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REMARKS 

Status of the Claims 

Claims 1, 4-6, 9, 10, 12, 13, 15-17, 19, 20 and 57-63 are pending in the application. 
Claims 13, 15, 19, 20 and 57 are withdrawn from consideration. Claims 13, 15, and 17 have 
been amended to further clarify the claimed subject matter. No new matter is added by these 
amendments. Applicants reserve the right to prosecute non-elected subject matter in subsequent 
divisional applications. 

Comments Regarding Restriction Requirement 

Applicants affirm the election with traverse of Group I, which corresponds to claims 1, 4- 
6, 9-10, 12, 16-17, 58-63, drawn to polynucleotides. 

Applicants reiterate the request that the Examiner withdraw the Restriction Requirement 
at least with respect to claims 13, 15, 19-20, and 57 of Group II, and examine those claims 
together with the elected polynucleotide claims of Group I. 

The rules under MPEP section 1893.03(d) require the Examiner to apply the Unity of 
Invention standard PCT Rule 13.2 instead of U.S. restriction/election of species practice in 
national stage applications, such as the instant application filed under 35 U.S.C. 371. Applicants 
believe unity of invention exists for claims drawn to the elected polynucleotide sequence of SEQ 
ID NO:4 (i.e., claim 1) and claims drawn to the polypeptide sequence encoded by the 
polynucleotide sequence of SEQ ID NO:4 (i.e., claim 13) based on the rules concerning unity of 
invention under the Patent Cooperation Treaty. Further, claim 13 has been placed in independent 
form. Therefore, Applicants, request that the Examiner withdraw the Restriction Requirement, at 
least with respect to claim 13 of Group II, and examine those claims together with the elected 
polynucleotide claims of Group I. 

Rejoinder of method claims upon allowance of product claims under U.S. practice 
The Examiner is respectfully reminded that claims 19, 20, and 57, directed to methods of 
using the claimed polynucleotides, are entitled to rejoinder upon allowance of a product claim 
per the Commissioner's Notice in the Official Gazette of March 26, 1996, entitled "Guidance on 
Treatment of Product and Process Claims in light of In re Ochiai, In re Brouwer and 35 U.S.C. § 
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103(b)" which sets forth the rules, upon allowance of a product claim, for rejoinder of process 

claims covering the same scope of products. Likewise, claim 15, which is directed to a method 
of using the polypeptide of claim 13, should be rejoined upon allowance of product claim 13. 

Priority Claim 

As suggested by the Examiner, the paragraph following the title is now updated. 

New Matter Rejection 

Claims 58-63 were rejected under 35 U.S.C. § 112 because of alleged new matter. 
Applicants point out that support for these claims is provided, for example, at page 3, line 19, 
page 11, line 26, page 14, lines 21-23, page 8, lines 40-41, page 1 1 lines 8-10, page 24, lines 6- 
14, and page 48, lines 15-17 of the Specification. Withdrawal of this rejection is therefore 
respectfully requested. 

Utility rejection under 35 U.S.C. § 101 

Claims 1, 4-6, 9-10, 12, 16-17, 58-63 were rejected under 35 U.S.C. § 101 because the 
claimed invention is not supported by either a specific asserted utility or a well established utility. 
This rejection is respectfully traversed. 

The rejection of claims 1, 4-6, 9-10, 12, 16-17, 58-63 is improper, as the inventions of 
those claims have a patentable utility as set forth in the instant specification, and/or a utility well 
known to one of ordinary skill in the art. 

The invention at issue is a polynucleotide sequence corresponding to a gene that is 
expressed in humans. Similarities between SEQ ID NO:4 and Mus musculus Impact (g4038076) 
are described in Table 1. As such, the claimed invention has numerous practical, beneficial uses 
in toxicology testing, drug development, and the diagnosis of disease, none of which requires 
knowledge of how the polypeptide coded for by the polynucleotide actually functions. 

Applicants submit with this paper the Declaration of Dr. Tod Bedilion 1 describing some 



*The Bedilion Declaration is submitted herewith in unexecuted form. The executed 
Declaration will be submitted to the Patent office as soon as it is available. 
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of the practical uses of the claimed invention in gene and protein expression monitoring 

applications. The Bedilion Declaration demonstrates that the positions and arguments made by 
the Patent Examiner with respect to the utility of the claimed polynucleotide are without merit. 

Note that the instant application is the National Stage of International Application No. 
PCT/US00/15344, filed June 1, 2000, which claims the benefit under 35 U.S.C. § 119(e) of 
provisional application U.S. Ser. No. 60/147,542, filed August 5, 1999 (hereinafter 'the Hodgson 
'542 application). The Hodgson '542 application contains the same disclosure with respect to the 
claimed invention as the Hodgson '416 application. For the sake of convenience, Applicants cite 
to and discuss the Hodgson '416 specification below on the understanding that the descriptions in 
that specification have the August 5, 1999 priority date of the Hodgson '542 application. 
Applicants will provide the page and line numbers to indicate the equivalent citations within the 
specification of the Hodgson '542 application if the Examiner so requests. 

The Bedilion Declaration describes, in particular, how the claimed expressed 

polynucleotide can be used in gene expression monitoring applications that were well-known at 

the time the patent application was filed, and how those applications are useful in developing 

drugs and monitoring their activity. Dr. Bedilion states that the claimed invention is a useful tool 

when employed as a highly specific probe in a cDNA microarray: 

Persons skilled in the art would appreciate that cDNA microarrays that contained the SEQ 
ID NO:4-encoding polynucleotides would be a more useful tool than cDNA microarrays 
that did not contain the polynucleotides in connection with conducting gene expression 
monitoring studies on proposed (or actual) drugs for treating autoimmune/inflammatory 
disorders and cell proliferative disorders, including cancer, for such purposes as 
evaluating their efficacy and toxicity. 

The Patent Examiner does not dispute that the claimed polynucleotide can be used as a 
probe in cDNA microarrays and used in gene expression monitoring applications. Instead, the 
Patent Examiner contends that the claimed polynucleotide cannot be useful without precise 
knowledge of its biological function. But the law never has required knowledge of biological 
function to prove utility. It is the claimed invention's uses, not its functions, that are the subject 
of a proper analysis under the utility requirement. 

In any event, as demonstrated by the Bedilion Declaration, the person of ordinary skill in 
the art can achieve beneficial results from the claimed polynucleotide in the absence of any 
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knowledge as to the precise function of the protein encoded by it. The uses of the claimed 

polynucleotide in gene expression monitoring applications are in fact independent of its precise 
function. 

I. The Applicable Legal Standard 

To meet the utility requirement of sections 101 and 1 12 of the Patent Act, the patent 

applicant need only show that the claimed invention is "practically useful," Anderson v. Natta, 

480 F.2d 1392, 1397, 178 USPQ 458 (CCPA 1973) and confers a "specific benefit" on the 

public. Brenner v. Manson, 383 U.S. 519, 534-35, 148 USPQ 689 (1966). As discussed in a 

recent Court of Appeals for the Federal Circuit case, this threshold is not high: 

An invention is "useful" under section 101 if it is capable of providing some identifiable 
benefit. See Brenner v. Manson, 383 U.S. 519, 534 [148 USPQ 689] (1966); Brooktree 
Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 1571 [24 USPQ2d 1401] (Fed. 
Cir. 1992) ("to violate Section 101 the claimed device must be totally incapable of 
achieving a useful result"); Fuller v. Berger, 120 F. 274, 275 (7th Cir. 1903) (test for 
utility is whether invention "is incapable of serving any beneficial end"). 

Juicy Whip Inc. v. Orange Bang Inc., 51 USPQ2d 1700 (Fed. Cir. 1999). 

While an asserted utility must be described with specificity, the patent applicant need not 

demonstrate utility to a certainty. In Stiftung v. Renishaw PLC, 945 F.2d 1173, 1180, 

20 USPQ2d 1094 (Fed. Cir. 1991), the United States Court of Appeals for the Federal Circuit 

explained: 

An invention need not be the best or only way to accomplish a certain result, and it need 
only be useful to some extent and in certain applications: "[T]he fact that an invention has 
only limited utility and is only operable in certain applications is not grounds for finding 
lack of utility." Envirotech Corp. v. Al George, Inc., 730 F.2d 753, 762, 221 USPQ 473, 
480 (Fed. Cir. 1984). 

The specificity requirement is not, therefore, an onerous one. If the asserted utility is 
described so that a person of ordinary skill in the art would understand how to use the claimed 
invention, it is sufficiently specific. See Standard Oil Co. v. Montedison, S.p.a., 212 U.S.P.Q. 
327, 343 (3d Cir. 1981). The specificity requirement is met unless the asserted utility amounts to 
a "nebulous expression" such as "biological activity" or "biological properties" that does not 
convey meaningful information about the utility of what is being claimed. Cross v. Iizuka, 
753 F.2d 1040, 1048 (Fed. Cir. 1985). 
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In addition to conferring a specific benefit on the public, the benefit must also be 

"substantial." Brenner, 383 U.S. at 534. A "substantial" utility is a practical, "real-world" 
utility. Nelson v. Bowler, 626 F.2d 853, 856, 206 USPQ 881 (CCPA 1980). 

If persons of ordinary skill in the art would understand that there is a "well-established" 
utility for the claimed invention, the threshold is met automatically and the applicant need not 
make any showing to demonstrate utility. Manual of Patent Examination Procedure at 
§ 706.03(a). Only if there is no "well-established" utility for the claimed invention must the 
applicant demonstrate the practical benefits of the invention. Id. 

Once the patent applicant identifies a specific utility, the claimed invention is presumed 
to possess it. In re Cortright, 165 F.3d 1353, 1357, 49 USPQ2d 1464 (Fed. Cir. 1999); In re 
Brana, 51 F.3d 1560, 1566; 34 USPQ2d 1436 (Fed. Cir. 1995). In that case, the Patent Office 
bears the burden of demonstrating that a person of ordinary skill in the art would reasonably 
doubt that the asserted utility could be achieved by the claimed invention. Id. To do so, the 
Patent Office must provide evidence or sound scientific reasoning. See In re Longer, 503 F.2d 
1381), l jyi-y2, 183 U5*V z86 (ttrA iy/<*). i± anu uiuy n uic raicm umw suwi « 

showing, the burden shifts to the applicant to provide rebuttal evidence that would convince the 
person of ordinary skill that there is sufficient proof of utility. Brana, 51 F.3d at 1566. The 
applicant need only prove a "substantial likelihood" of utility; certainty is not required. Brenner, 
383 U.S. at 532. 

II. Toxicology testing, drug discovery, and disease diagnosis are sufficient utilities 
under 35 U.S.C. §§ 101 and 112, first paragraph 

The claimed invention meets all of the necessary requirements for establishing a credible 
utility under the Patent Law: There are "well-established" uses for the claimed invention known 
to persons of ordinary skill in the art, and there are specific practical and beneficial uses for the 
invention disclosed in the patent application's specification. These uses are explained, in detail, 
in the Bedilion Declaration accompanying this response. Objective evidence, not considered by 
the Patent Office, further corroborates the credibility of the asserted utilities. 

A. The use of SEQ D3 NO:4 polynucleotides for toxicol gy testing, drug 
disc very, and disease diagnosis are practical uses that confer "specific 
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benefits" to the public 

The claimed invention has specific, substantial, real-world utility by virtue of its use in 
toxicology testing, drug development and disease diagnosis through gene expression profiling. 
These uses are explained in detail in the accompanying Bedilion Declaration, the substance of 
which is not rebutted by the Patent Examiner. There is no dispute that the claimed invention is in 
fact a useful tool in cDNA microarrays used to perform gene expression analysis. That is 
sufficient to establish utility for the claimed polynucleotide. 

In his Declaration, Dr. Bedilion explains the many reasons why a person skilled in the art 
reading the Hodgson "416 priority application, the Hodgson '542 application, on August 5, i999 
would have understood that application to disclose the claimed polynucleotide to be useful for a 
number of gene expression monitoring applications, e.g., as a highly specific probe for the 
expression of that specific polynucleotide in connection with the development of drugs and the 
monitoring of the activity of such drugs. (Bedilion Declaration at, e.g., ffl 10-15). Much, but not 
all, of Dr. Bedilion' s explanation concerns the use of the claimed polynucleotide in cDNA 
microarrays of the type first developed at Stanford University for evaluating the efficacy and 
toxicity of drugs, as well as for other applications. (Bedilion Declaration, fl 12 and 15). 2 

In connection with his explanations, Dr. Bedilion states that the "Hodgson '416 
specification would have led a person skilled in the art on August 5, 1999 who was using gene 
expression monitoring in connection with working on developing new drugs for the treatment of 
autoimmune/inflammatory disorders and cell proliferative disorders, including cancer, [a] to 
conclude that a cDNA microarray that contained the SEQ ID NO:4 polynucleotides would be a 
highly useful tool, and [b] to request specifically that any cDNA microarray that was being used 
for such purposes contain the SEQ ID NO:4 polynucleotides" (Bedilion Declaration, f 15 ). For 
example, as explained by Dr. Bedilion, "[p]ersons skilled in the art would [have appreciated on 
August 5, 1999] that a cDNA microarray that contained the SEQ ID NO:4 polynucleotides would 
be a more useful tool than a cDNA microarray that did not contain the polynucleotides in 
connection with conducting gene expression monitoring studies on proposed (or actual) drugs for 



2 Dr. Bedilion also explained, for example, why persons skilled in the art would also 
appreciate, based on the Hodgson '416 specification, that the claimed polynucleotide would be 
useful in connection with developing new drugs using technology, such as Northern analysis, that 
predated by many years the development of the cDNA technology (Bedilion Declaration, 1 16). 
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treating autoimmune/inflammatory disorders and cell proliferative disorders, including cancer, 

for such purposes as evaluating their efficacy and toxicity." Id. 

In support of those statements, Dr. Bedilion provided detailed explanations of how cDNA 
technology can be used to conduct gene expression monitoring evaluations, with extensive 
citations to pre-August 5, 1999 publications showing the state of the art on August 5, 1999. 
(Bedilion Declaration, f % 10-14). While Dr. Bedilion's explanations in paragraph 15 of his 
Declaration include almost three pages of text and five subparts (a)-(e), he specifically states that 
his explanations are not "all-inclusive." Id. For example, with respect to toxicity evaluations, 
Dr. Bedilion had earlier explained how persons skilled in the art who were working on drug 
development on August 5, 1999 (and for several years prior to August 5, 1999) "without any 
doubt" appreciated that the toxicity (or lack of toxicity) of any proposed drug was "one of the 
most important criteria to be evaluated in connection with the development of the drug" and how 
the teachings of the Hodgson '416 application clearly include using differential gene expression 
analyses in toxicity studies (Bedilion Declaration, f 10). 

Thus, the Bedilion Declaration establishes thai persons skilled in the art leading the 
Hodgson '416 application at the time it was filed "would have wanted their cDNA microarray to 
have a [SEQ ID NO:4 polynucleotide probe] because a microarray that contained such a probe 
(as compared to one that did not) would provide more useful results in the kind of gene 
expression monitoring studies using cDNA microarrays that persons skilled in the art have been 
doing since well prior to August 5, 1999" (Bedilion Declaration, f 15, item (e)). This, by itself, 
provides more than sufficient reason to compel the conclusion that the Hodgson '416 application 
disclosed to persons skilled in the art at the time of its filing substantial, specific and credible 
real-world utilities for the claimed polynucleotide. 

Nowhere does the Patent Examiner address the fact that, as described on pp. 46-48 of the 
Hodgson '416 application, the claimed polynucleotides can be used as highly specific probes in, 
for example, cDNA microarrays - probes that without question can be used to measure both the 
existence and amount of complementary RNA sequences known to be the expression products of 
the claimed polynucleotides. The claimed invention is not, in that regard, some random sequence 
whose value as a probe is speculative or would require further research to determine. 

Given the fact that the claimed polynucleotide is known to be expressed, its utility as a 
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measuring and analyzing instrument for expression levels is as indisputable as a scale's utility for 
measuring weight. This use as a measuring tool, regardless of how the expression level data 
ultimately would be used by a person of ordinary skill in the art, by itself demonstrates that the 
claimed invention provides an identifiable, real-world benefit that meets the utility requirement. 
Raytheon v. Roper, 724 R2d 951, (Fed. Cir. 1983) (claimed invention need only meet one of its 
stated objectives to be useful); In re Cortwright, 165 F.3d 1353, 1359 (Fed. Cir. 1999) (how the 
invention works is irrelevant to utility); MPEP § 2107 ("Many research tools such as gas 
chromatographs, screening assays, and nucleotide sequencing techniques have a clear, specific, 
and unquestionable utility (e.g., thev are useful in analyzing compounds )'* (emphasis added)). 

Though Applicants need not so prove to demonstrate utility, there can be no reasonable 
dispute that persons of ordinary skill in the art have numerous uses for information about relative 
gene expression including, for example, understanding the effects of a potential drug for treating 
autoimmune/inflammatory disorders and cell proliferative disorders, including cancer. Because 
the patent application states explicitly that the claimed polynucleotide expresses a protein that is 
a member of the Impact family known lu be associated with diseases such as 
autoimmune/inflammatory disorders and cell proliferative disorders, including cancer, there can 
be no reasonable dispute that a person of ordinary skill in the art could put the claimed invention 
to such use. In other words, the person of ordinary skill in the art can derive more information 
about a potential autoimmune/inflammatory disorders and cell proliferative disorders, including 
cancer, drug candidate or potential toxin with the claimed invention than without it (see Bedilion 
Declaration at, e.g., f 15, subpart (e)). 

The Bedilion Declaration shows that a number of pre-August 5, 1999 publications 
confirm and further establish the utility of cDNA microarrays in a wide range of drug 
development gene expression monitoring applications at the time the Hodgson '416 application 
was filed (Bedilion Declaration H 10-14; Bedilion Exhibits A-G). Indeed, Brown and Shalon 
U.S. Patent No. 5,807,522 (the Brown '522 patent, Bedilion Exhibit D), which issued from a 
patent application filed in June 1995 and was effectively published on December 29, 1995 as a 
result of the publication of a PCT counterpart application, shows that the Patent Office 
recognizes the patentable utility of the cDNA technology developed in the early to mid-1990s. 
As explained by Dr. Bedilion, among other things (Bedilion Declaration, % 12): 
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The Brown '522 patent further teaches that the "[m]icroarrays of immobilized 
nucleic acid sequences prepared in accordance with the invention" can be used in 
"numerous" genetic applications, including "monitoring of gene expression" 
applications (see Bedilion Tab D at col. 14, lines 36-42). The Brown '522 patent 
teaches (a) monitoring gene expression (i) in different tissue types, (ii) in different 
disease states, and (iii) in response to different drugs, and (b) that arrays disclosed 
therein may be used in toxicology studies (see Bedilion Tab D at col. 15, lines 13- 
18 and 52-58 and col. 18, lines 25-30). 

Literature reviews published shortly before the filing of the Hodgson '416 application 

describing the state of the art further confirm the claimed invention's utility. Rockett et al. 

confirm, for example, that the claimed invention is useful for differential expression analysis 

regardless of how expression is regulated: 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular 
analysis, recognition of the importance of differential gene expression and 
characterization of differentially expressed genes has existed for many years. 

* * * 

Although differential expression technologies are applicable to a broad range of 
models, perhaps their most important advantage is that, in most cases, absolutely 
no prior knowledge of the specific genes which are up- or down-regulated is 
required. 

Whereas it would be informative to know the identity and functionality of all 
genes up/down regulated by . . . toxicants, this would appear a longer term goal 
.... However, the current use of gene profiling yields a pattern of gene changes 
for a xenobiotic of unknown toxicity which may be matched to that of well 
characterized toxins, thus alerting the toxicologist to possible in vivo similarities 
between the unknown and the standard, thereby providing a platform for more 
extensive toxicological examination, (emphasis added) 

Rockett et al., Differential gene expression in drug metabolism and toxicology: practicalities, 
problems and potential , 29 Xenobiotica No. 7, 655 (1999). 

In another pre-August 5, 1999 article, Lashkari et al. state explicitly that sequences that 
are merely "predicted" to be expressed (predicted Open Reading Frames, or ORFs) - the claimed 
invention in fact is known to be expressed - have numerous uses: 
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Efforts have been directed toward the amplification of each predicted ORF or any 
other region of the genome ranging from a few base pairs to several kilobase 
pairs. There are many uses for these amplicons- they can be cloned into standard 
vectors or specialized expression vectors, or can be cloned into other specialized 
vectors such as those used for two-hybrid analysis. The amplicons can also be 
used directly by, for example, arraying onto glass for expression analysis , for 
DNA binding assays, or for any direct DNA assay. 

Lashkari et al., Whole genome analysis: Experimental access to all genome sequenced segments 
through larger-scale efficient oligonucleotide synthesis and PCR , 94 Proc. Nat. Acad. Sci. 8945 
(Aug. 1997) (emphasis added). 

B. The use of nucleic acids coding for proteins expressed by humans as tools for 
toxicology testing, drug discovery, and the diagnosis of disease is now "well- 
established" 

The technologies made possible by expression profiling and the DNA tools upon which 
they rely are now well-established. The technical literature recognizes not only the prevalence of 
these technologies, but also their unprecedented advantages in drug development, testing and 
safety assessment. These technologies include toxicology testing, as described by Bedilion in his 
Declaration. 

Toxicology testing is now standard practice in the pharmaceutical industry. See, e.g., 

John C. Rockett et al., supra: 

Knowledge of toxin-dependent regulation in target tissues is not solely an academic 
pursuit as much interest has been generated in the pharmaceutical industry to harness this 
technology in the early identification of toxic drug candidates, thereby shortening the 
developmental process and contributing substantially to the safety assessment of new 
drugs. 

To the same effect are several other scientific publications, including Emile F. Nuwaysir et al., 
Microarravs and Toxicology: The Advent of Toxicogenomics , 24 Molecular Carcinogenesis 153 
(1999); Sandra Steiner and N. Leigh Anderson, Expression profiling in toxicology — potentials 
and limitations , 112-13 Toxicology Letters 467 (2000). 

Nucleic acids useful for measuring the expression of whole classes of genes are routinely 
incorporated for use in toxicology testing. Nuwaysir et al. describes, for example, a Human 
ToxChip comprising 2089 human clones, which were selected 

for their well-documented involvement in basic cellular processes as well as their 
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responses to different types of toxic insult. Included on this list are DNA replication and 
repair genes, apoptosis genes, and genes responsive to PAHs and dioxin-like compounds, 
peroxisome proliferators, estrogenic compounds, and oxidant stress. Some of the other 
categories of genes include transcription factors, oncogenes, tumor suppressor genes, 
cyclins, kinases, phosphatases, cell adhesion and motility genes, and homeobox genes. 
Also included in this group are 84 housekeeping genes, whose hybridization intensity is 
averaged and used for signal normalization of the other genes on the chip. 

See also Table 1 of Nuwaysir et al. (listing additional classes of genes deemed to be of special 

interest in making a human toxicology microarray). 

The more genes that are available for use in toxicology testing, the more powerful the 
technique. "Arrays are at their most powerful when they contain the entire genome of the species 
they are being used to study." John C. Rockett and David J. Dix, A pplication of DNA Arrays to 
Toxicology , 107 Environ. Health Perspec.681, No. 8 (1999). Control genes are carefully selected 
for their stability across a large set of array experiments in order to best study the effect of 
toxicological compounds. See attached email from the primary investigator on the Nuwaysir 
paper, Dr. Cynthia Afshari, to an Incyte employee, dated July 3, 2000, as well as the original 
message to which she was responding, indicating that even the expression of carefully selected 
control genes can be altered. Thus, there is no expressed gene which is irrelevant to screening 
for toxicological effects, and all expressed genes have a utility for toxicological screening. 

In fact, the potential benefit to the public, in terms of lives saved and reduced health care 
costs, are enormous. Recent developments provide evidence that the benefits of this information 
are already beginning to manifest themselves. Examples include the following: 

• In 1999, CV Therapeutics, an Incyte collaborator, was able to use Incyte gene 
expression technology, information about the structure of a known transporter 
gene, and chromosomal mapping location, to identify the key gene associated with 
Tangiers disease. This discovery took place over a matter of only a few weeks, 
due to the power of these new genomics technologies. The discovery received an 
award from the American Heart Association as one of the top 10 discoveries 
associated with heart disease research in 1999. 

• In an April 9, 2000, article published by the Bloomberg news service, an Incyte 
customer stated that it had reduced the time associated with target discovery and 
validation from 36 months to 18 months, through use of Incyte' s genomic 
information database. Other Incyte customers have privately reported similar 
experiences. The implications of this significant saving of time and expense for 
the number of drugs that may be developed and their cost are obvious. 
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In a February 10, 2000, article in the Wall Street Journal one Incyte customer 
stated that over 50 percent of the drug targets in its current pipeline were derived 
from the Incyte database. Other Incyte customers have privately reported similar 
experiences. By doubling the number of targets available to pharmaceutical 
researchers, Incyte genomic information has demonstrably accelerated the 
development of new drugs. 

Because the Patent Examiner failed to address or consider the "well-established" utilities 
for the claimed invention in toxicology testing, drug development, and the diagnosis of disease, 
the Examiner's rejections should be withdrawn regardless of their merit. 

C. The Uncontested Fact That the Claimed Polynucleotide Encodes for a 
Protein in the Impact Family Also Demonstrates Utility 

In addition to having substantial, specific and credible utilities in numerous gene 
expression monitoring applications, it is undisputed that the claimed polynucleotide SEQ ID 
NO:4 encodes for a protein and Applicants have demonstrated that SEQ ID NO:4 encoding 
polypeptide is a ineiiiuci ui uic imparl liumiy, anu uiat tn^ xinpaw itumi; v/i piw^n^ 
in imprinting. 

The Patent Examiner does not dispute any of the facts set forth in the previous paragraph. 
Neither does the Patent Examiner dispute that, if a polynucleotide encodes for a protein that has a 
substantial, specific and credible utility, then it follows that the polynucleotide also has a 
substantial, specific and credible utility. 

The Examiner must accept the applicant's demonstration that the polypeptide encoded by 
the claimed invention is a member of the Impact family and that utility is proven by a reasonable 
probability unless the Examiner can demonstrate through evidence or sound scientific reasoning 
that a person of ordinary skill in the art would doubt utility. See In re hanger, 503 F.2d 1380, 
1391-92, 183 USPQ 288 (CCPA 1974). The Examiner has not provided sufficient evidence or 
sound scientific reasoning to the contrary. 

Nor has the Examiner provided any evidence that any member of the Impact family, let 
alone a substantial number of those members, is not useful. In such circumstances, the only 
reasonable inference is that the polypeptide encoded by the claimed invention must be useful, 
like the other members of the Impact family. 
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D. Objective evidence corroborates the utilities of the claimed invention 

There is, in fact, no restriction on the kinds of evidence a Patent Examiner may consider 
in determining whether a "real-world" utility exists. Indeed, "real-world" evidence, such as 
evidence showing actual use or commercial success of the invention, can demonstrate conclusive 
proof of utility. Raytheon v. Roper, 220 USPQ2d 592 (Fed. Cir. 1983); Nestle v. Eugene, 55 
F.2d 854, 856, 12 USPQ 335 (6th Cir. 1932). Indeed, proof that the invention is made, used or 
sold by any person or entity other than the patentee is conclusive proof of utility. United States 
Steel Corp. v. Phillips Petroleum Co., 865 F.2d 1247, 1252, 9 USPQ2d 1461 (Fed. Cir. 1989). 

Over the past several years, a vibrant market has developed for databases containing all 
expressed genes (along with the polypeptide translations of those genes), in particular genes 
having medical and pharmaceutical significance such as the instant sequence. (Note that the 
value in these databases is enhanced by their completeness, but each sequence in them is 
independently valuable.) The databases sold by Applicants' assignee, Incyte, include exactly the 
kinds of information made possible by the claimed invention, such as tissue and disease 
associations. Incyte sells its database containing the claimed sequence and millions of other 
sequences throughout the scientific community, including to pharmaceutical companies who use 
the information to develop new pharmaceuticals. 

Both Incyte' s customers and the scientific community have acknowledged that Incyte' s 
databases have proven to be valuable in, for example, the identification and development of drug 
candidates. As Incyte adds information to its databases, including the information that can be 
generated only as a result of Incyte' s discovery of the claimed polynucleotide and its use of that 
polynucleotide on cDNA microarrays, the databases become even more powerful tools. Thus the 
claimed invention adds more than incremental benefit to the drug discovery and development 
process. 

III. The Patent Examiner's Rejections Are Without Merit 

Rather than responding to the evidence demonstrating utility, the Examiner attempts to 
dismiss it altogether by arguing that the disclosed and well-established utilities for the claimed 
polynucleotide are not "specific" utilities. (Office Action at p. 4). The Examiner is incorrect both 

H3111 20 10/009,416 



Docket No.: PT-1042USN 

as a matter of law and as a matter of fact. 

A. The Precise Biological Role Or Function Of An Expressed Polynucleotide Is 
Not Required To Demonstrate Utility 

The Patent Examiner's primary rejection of the claimed invention is based on the ground 
that, without information as to the precise "biological role" of the claimed invention, the claimed 
invention's utility is not sufficiently specific. According to the Examiner, it is not enough that a 
person of ordinary skill in the art could use and, in fact, would want to use the claimed invention 
either by itself or in a cDNA microarray to monitor the expression of genes for such applications 
as the evaluation of a drug's efficacy and toxicity. The Examiner would require, in addition, that 
the applicant provide a specific and substantial interpretation of the results generated in any given 
expression analysis. 

It may be that specific and substantial interpretations and detailed information on 
biological function are necessary to satisfy the requirements for publication in some technical 
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patent. The relevant question is not, as the Examiner would have it, whether it is known how or 
why the invention works, In re Cortwright, 165 F.3d 1353, 1359 (Fed. Cir. 1999), but rather 
whether the invention provides an "identifiable benefit" in presently available form. Juicy Whip 
Inc. v. Orange Bang Inc., 185 F.3d 1364, 1366 (Fed. Cir. 1999). If the benefit exists, and there is 
a substantial likelihood the invention provides the benefit, it is useful. There can be no doubt, 
particularly in view of the Bedilion Declaration (at, e.g., f[ 10 and 15, Bedilion), that the present 
invention meets this test. 

The threshold for determining whether an invention produces an identifiable benefit is 
low. Juicy Whip, 185 F.3d at 1366. Only those utilities that are so nebulous that a person of 
ordinary skill in the art would not know how to achieve an identifiable benefit and, at least 
according to the PTO guidelines, so-called "throwaway" utilities that are not directed to a person 
of ordinary skill in the art at all, do not meet the statutory requirement of utility. Utility 
Examination Guidelines, 66 Fed. Reg. 1092 (Jan. 5, 2001). 

Knowledge of the biological function or role of a biological molecule has never been 
required to show real-world benefit. In its most recent explanation of its own utility guidelines, 
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the PTO acknowledged so much (66 F.R. at 1095): 

[T]he utility of a claimed DNA does not necessarily depend on the function of the 
encoded gene product. A claimed DNA may have specific and substantial utility 
because, e.g., it hybridizes near a disease-associated gene or it has gene-regulating 
activity. 

By implicitly requiring knowledge of biological function for any claimed nucleic acid, the 
Examiner has, contrary to law, elevated what is at most an evidentiary factor into an absolute 
requirement of utility. Rather than looking to the biological role or function of the claimed 
invention, the Examiner should have looked first to the benefits it is alleged to provide. 

B. Membership in a Class of Useful Products Can Be Proof of Utility 

Despite the uncontradicted evidence that the claimed polynucleotide encodes a 
polypeptide in the Impact family, the Examiner refused to impute the utility of the members of 
the Impact family to SEQ ID NO:4. In the Office Action, the Patent Examiner takes the position 
that, unless Applicants can identify which particular disease within the class of Impacts is 
possessed by SEQ ID NO:4, utility cannot be imputed. To demonstrate utility by membership in 
the class of Impacts, the Examiner would require that all Impacts possess a "common" utility. 

There is no such requirement in the law. In order to demonstrate utility by membership in 
a class, the law requires only that the class not contain a substantial number of useless members. 
So long as the class does not contain a substantial number of useless members, there is sufficient 
likelihood that the claimed invention will have utility, and a rejection under 35 U.S.C. § 101 is 
improper. That is true regardless of how the claimed invention ultimately is used and whether or 
not the members of the class possess one utility or many. See Brenner v. Manson, 383 U.S. 519, 
532 (1966); Application of Kirk, 376 F.2d 936, 943 (CCPA 1967). 

Membership in a "general" class is insufficient to demonstrate utility only if the class 
contains a sufficient number of useless members such that a person of ordinary skill in the art 
could not impute utility by a substantial likelihood. There would be, in that case, a substantial 
likelihood that the claimed invention is one of the useless members of the class. In the few cases 
in which class membership did not prove utility by substantial likelihood, the classes did in fact 
include predominately useless members. E.g., Brenner (man-made steroids); Kirk (same); Natta 
(man-made polyethylene polymers). 
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The Examiner addresses SEQ ID NO:4 as if the general class in which it is included is 

not the Impact family, but rather all polynucleotides or all polypeptides, including the vast 

majority of useless theoretical molecules not occurring in nature, and thus not pre-selected by 

nature to be useful. While these "general classes" may contain a substantial number of useless 

members, the Impact family does not. The Impact family is sufficiently specific to rule out any 

reasonable possibility that SEQ ID NO:4 would not also be useful like the other members of the 

family. 

Because the Examiner has not presented any evidence that the Impact class of proteins 
has any, let alone a substantial number, of useless members, the Examiner must conclude that 
there is a "substantial likelihood" that the polypeptides encoded by the claimed polynucleotides 
are useful. It follows that the claimed polynucleotides also are useful. 

It is undisputed that known members of the Impact family are proteins involved in 
genomic imprinting. A person of ordinary skill in the art need not know any more about how the 
claimed invention functions to use it, and the Examiner presents no evidence to the contrary. 

As demonstrated by Appiicants, knowledge thai SEQ ID NO:4 is an Impact protein is 
more than sufficient to make it useful for the diagnosis and treatment of 
autoimmune/inflammatory disorders and cell proliferative disorders, including cancer. The 
Examiner must accept these facts to be true unless the Examiner can provide evidence or sound 
scientific reasoning to the contrary. But the Examiner has not done so. 

C. Because the uses of polynucleotides of SEQ ID NO:4 in toxicology testing, 
drug discovery, and disease diagnosis are practical uses beyond mere study 
of the invention itself, the claimed invention has substantial utility. 

The PTO rejected the claims at issue on the ground that the use of an invention as a tool 
for research is not a "substantial" use. Because the PTO's rejection assumes a substantial 
overstatement of the law, and is incorrect in fact, it must be withdrawn. 

There is no authority for the proposition that use as a tool for research is not a substantial 

utility. Indeed, the Patent Office has recognized that just because an invention is used in a 

research setting does not mean that it lacks utility (MPEP § 2107): 

Many research tools such as gas chromatographs, screening assays, and nucleotide 
sequencing techniques have a clear, specific and unquestionable utility (e.g., they are 
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useful in analyzing compounds). An assessment that focuses on whether an invention is 
useful only in a research setting thus does not address whether the specific invention is in 
fact "useful" in a patent sense. Instead, Office personnel must distinguish between 
inventions that have a specifically identified utility and inventions whose specific utility 
requires further research to identify or reasonably confirm. 

The Patent Office's actual practice has been, at least until the present, consistent with that 
approach. It has routinely issued patents for inventions whose only use is to facilitate research, 
such as DNA ligases. These are acknowledged by the PTO's Training Materials themselves to 
be useful, as well as DNA sequences used, for example, as markers. 

Only a limited subset of research uses are not "substantial" utilities: those in which the 
only known use for the claimed invention is to be an object of further study, thus merely inviting 
further research. This follows from Brenner, in which the U.S. Supreme Court held that a 
process for making a compound does not confer a substantial benefit where the only known use 
of the compound was to be the object of further research to determine its use. Id. at 535. 
Similarly, in Kirk, the Court held that a compound would not confer substantial benefit on the 
public merely because it might be used to synthesize some other, unknown compound that would 
confer substantial benefit. Kirk, 376 F.2d at 940, 945 ("What Applicants are really saying to 
those in the art is take these steroids, experiment, and find what use they do have as medicines"). 
Nowhere do those cases state or imply, however, that a material cannot be patentable if it has 
some other beneficial use in research. 

As used in toxicology testing, drug discovery, and disease diagnosis, the claimed 
invention has a beneficial use in research other than studying the claimed invention or its protein 
products. It is a tool, rather than an object, of research. The data generated in gene expression 
monitoring using the claimed invention as a tool is not used merely to study the claimed 
polynucleotide itself, but rather to study properties of tissues, cells, and potential drug candidates 
and toxins. Without the claimed invention, the information regarding the properties of tissues, 
cells, drug candidates and toxins is less complete. (Bedilion Declaration at f 15.) 

The claimed invention has numerous additional uses as a research tool, each of which 
alone is a "substantial utility." These include uses such as diagnostic assays (e.g., pages 36-39), 
chromosomal markers (e.g., pages 39-40), and ligand screening assays (e.g., page 40). 
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IV. By Requiring the Patent Applicant to Assert a Particular or Unique Utility, the 
Patent Examination Utility Guidelines and Training Materials Applied by the 
Patent Examiner Misstate the Law 



There is an additional, independent reason to withdraw the rejections: to the extent the 
rejections are based on Revised Interim Utility Examination Guidelines (64 FR 71427, 
December 21, 1999), the final Utility Examination Guidelines (66 FR 1092, January 5, 2001) 
and/or the Revised Interim Utility Guidelines Training Materials (USPTO Website 
www.uspto.gov, March 1, 2000), the Guidelines and Training Materials are themselves 
inconsistent with the law. 

The Training Materials, which direct the Examiners regarding how to apply the Utility 
Guidelines, address the issue of specificity with reference to two kinds of asserted utilities: 
"specific" utilities which meet the statutory requirements, and "general" utilities which do not. 
The Training Materials define a "specific utility" as follows: 

A [specific utility] is specific to the subject matter claimed. This contrasts to general 

Utility L11CIL WUU1U UV/ appi-lt/ClL/ll/ IV/ U1U U1V/UU V/x imwuuwii. jl \sx v/iiuiij/iv, ** viwim w — 

polynucleotide whose use is disclosed simply as "gene probe" or "chromosome marker" 
would not be considered to be specific in the absence of a disclosure of a specific DNA 
target. Similarly, a general statement of diagnostic utility, such as diagnosing an 
unspecified disease, would ordinarily be insufficient absent a disclosure of what condition 
can be diagnosed. 

The Training Materials distinguish between "specific" and "general" utilities by assessing 
whether the asserted utility is sufficiently "particular," Le. 9 unique (Training Materials at p.52) as 
compared to the "broad class of invention." (In this regard, the Training Materials appear to 
parallel the view set forth in Stephen G. Kunin, Written Description Guidelines and Utility 
Guidelines . 82 J.P.T.O.S. 77, 97 (Feb. 2000) ("With regard to the issue of specific utility the 
question to ask is whether or not a utility set forth in the specification is particular to the claimed 
invention.")). 

Such "unique" or "particular" utilities never have been required by the law. To meet the 
utility requirement, the invention need only be "practically useful," Natta, 480 F.2d 1 at 1397, 
and confer a "specific benefit" on the public. Brenner, 383 U.S. at 534. Thus, incredible "throw- 
away" utilities, such as trying to "patent a transgenic mouse by saying it makes great snake food," 
do not meet this standard. Karen Hall, Genomic Warfare , The American Lawyer 68 (June 2000) 
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(quoting John Doll, Chief of the Biotech Section of USPTO). 

This does not preclude, however, a general utility, contrary to the statement in the 
Training Materials where "specific utility" is defined (page 5). Practical real-world uses are not 
limited to uses that are unique to an invention. The law requires that the practical utility be 
"definite," not particular. 

Montedison, 664 F.2d at 375. Applicant is not aware of any court that has rejected an assertion 
of utility on the grounds that it is not "particular" or "unique" to the specific invention. Where 
courts have found utility to be too "general," it has been in those cases in which the asserted 
utility in the patent disclosure was not a practical use that conferred a specific benefit. That is, a 
person of ordinary skill in the art would have been left to guess as to how to benefit at all from 
the invention. In Kirk, for example, the CCPA held the assertion that a man-made steroid had 
"useful biological activity" was insufficient where there was no information in the specification 
as to how that biological activity could be practically used. Kirk, 376 F.2d at 941. 

The fact that an invention can have a particular use does not provide a basis for requiring 
a particular use. oee? Drunu, supru ^ui&ciuauic uoaunun^ « tiaiuiwu wnnwinui ww**^ vr**«i** 
being homologous to an antitumor compound having activity against a "particular" type of cancer 
was determined to satisfy the specificity requirement). "Particularity" is not and never has been 
the sine qua non of utility; it is, at most, one of many factors to be considered. 

As described supra, broad classes of inventions can satisfy the utility requirement so long 
as a person of ordinary skill in the art would understand how to achieve a practical benefit from 
knowledge of the class. Only classes that encompass a significant portion of nonuseful members 
would fail to meet the utility requirement. Supra § DLB.2 [Montedison, 664 F.2d at 374-75). 

The Training Materials fail to distinguish between broad classes that convey information 
of practical utility and those that do not, lumping all of them into the latter, unpatentable category 
of "general" utilities. As a result, the Training Materials paint with too broad a brush. 
Rigorously applied, they would render unpatentable whole categories of inventions that 
heretofore have been considered to be patentable and that have indisputably benefitted the public, 
including the claimed invention. See supra § H.B. Thus the Training Materials cannot be 
applied consistently with the law. 
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For at least the above reasons, withdrawal of this rejection is respectfully requested. 



Enablement rejections under 35 U.S.C. § 112. first paragraph 

Claims 1, 4-6, 9-10, 12, 16-17, 58-63 were rejected under 35 U.S.C. § 112, first 
paragraph. The rejection set forth in the Office Action is based on the assertions discussed 
above, i.e., that the claimed invention lacks patentable utility. To the extent that the rejection 
under § 1 12, first paragraph, is based on the improper allegation of lack of patentable utility 
under § 101, it fails for the same reasons. 

Claims 1, 4-6, 9-10, 12, 16-17, 58-63 were also rejected under 35 U.S.C. § 112, first 
paragraph because the specification disclosure is insufficient to enable one skilled in the art to 
practice the invention as broadly claimed without an undue amount of experimentation. 

Applicants point out that while nucleotides which "comprise at least 30 contiguous 
nucleotides" might conceivably detect both bipolar and prostate sequences in addition to the 
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and SEQ ID NO:4 by sequencing the 3 molecules. Sequencing techniques are routine in the art. 
In fact, it is a routine procedure for one skilled in the art to sequence the target sequence for 
confirmation. Moreover, the claims contain recitations such as "detecting a target 
polynucleotide" or formation of "a specific hybridization complex." By use of appropriate 
conditions would one be able to perform the claimed methods so as to detect the polynucleotides 
recited by the claims. Determination of such conditions is also routine in the art. 

For at least the above reasons, withdrawal of the enablement rejections of claims 1, 4-6, 
9-10, 12, 16-17, 58-63 are respectfully requested. 

Written description rejection under 35 U.S.C. § 1 12, first paragraph 

Claims 1, 4-6, 9-10, 12, 16-17, 58-63 were rejected under the written description 35 
U.S.C. § 1 12, first paragraph, as allegedly containing subject matter which was not described in 
such a way as to reasonably convey to one skilled in the relevant art that the inventor(s), as the 
time the application was filed, had possession of the claimed invention. The Examiner states that 
the specification provides insufficient written description to support the genus encompassed by 
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the claim. Applicants respectfully traverse the rejection for at least the following reasons. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. § 
1 12, first paragraph, are well established by case law. 

... the applicant must also convey with reasonable clarity to those skilled in the 
art that, as of the filing date sought, he or she was in possession of the invention. 
The invention is, for purposes of the "written description" inquiry, whatever is 
now claimed. Vas-Cath, Inc. v. Mahurkar, 19 USPQ2d 1111, 1117 (Fed. Cir. 
1991) [emphasis added] 

. Mention of representative compounds encompassed by generic claim 
language clearly is not required by Section 112 or any other provision of the 
statute. But, where no explicit description of a generic invention is to be found in 
the specification... mention of representative compounds may provide an implicit 
description upon which to base generic claim language. In re Robins, 429 F.2d 
452, 456-57, 166 USPQ 552, 555 (CCPA 1970) [emphasis added] 

. . . [I]t has been consistently held that the naming of one member of such 
a group is not, in itself, a proper basis for a claim to the entire group. However, it 
may not oe necessary iu enumeruiv u pnuiutuy uj ^ct>c^ §j ** 
sufficiently identified in an application by 'other appropriate language. ' In re 
Grimme, 214 F.2d 949, 952, 124 USPQ 499, 501 (CCPA 1960) [emphasis added] 

Attention is also drawn to the Patent and Trademark Office's own "Guidelines for 
Examination of Patent Applications Under the 35 U.S.C. Sec. 1 12, para. 1", published January 5, 
2001, which provide that: 

An applicant may also show that an invention is complete by disclosure of 
sufficiently detailed, relevant identifying characteristics which provide evidence 
that applicant was in possession of the claimed invention, i.e., complete or partial 
structure, other physical and/or chemical properties, functional characteristics 
when coupled with a known or disclosed correlation between function and 
structure, or some combination of such characteristics. What is conventional or 
well known to one of ordinary skill in the art need not be disclosed in detail. If a 
skilled artisan would have understood the inventor to be in possession of the 
claimed invention at the time of filing, even if every nuance of the claims is not 
explicitly described in the specification, then the adequate description requirement 
is met. [footnotes omitted] 

Thus, the written description standard is fulfilled by both what is specifically disclosed 
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and what is conventional or well known to one skilled in the art. 

The Specification provides an adequate written description of the claimed "variants" of 
SEQIDNO:4. 

SEQ ID NO:4 is specifically disclosed in the application (see, for example, the Sequence 
Listing). Variants of SEQ ID NO:4 having 90% sequence identity to SEQ ID NO:4 are 
described, for example, at page 17, line 23. Incyte clones and shotgun sequences in which the 
nucleic acids encoding SEQ ID NO:4 are described, for example, at Table 4 of the specification. 
Chemical and structural features of SEQ ID NO:4 are described, for example, at Table 2 of the 
specification. Given SEQ ID NO:4, one of ordinary skill in the art would recognize naturally- 
occurring variants of SEQ ID NO:4 having 90% sequence identity to SEQ ID NO:4. 

In addition, the specification discloses examples of naturally occurring polynucleotide 
variants including allelic variants (page 6, lines 29-34), splice variants, species variants, or 
polymorphic variants, such as single nucleotide polymorphisms (SNPs) (page 17, lines 33-35 to 
page 18, iine 1). The specification discloses how to calculate the % identity between two 
sequences (see the specification at page 44, lines 28 through page 45, line 13), allowing one of 
skill in the art to determine which naturally occurring sequences are encompassed by the claims. 
Accordingly, the specification provides an adequate written description of the recited variant 
polynucleotide sequences. 

A. The present claims specifically define the claimed genus through the 
recitation of chemical structure 

Court cases in which "DNA claims" have been at issue commonly emphasize that the 

recitation of structural features or chemical or physical properties are important factors to 

consider in a written description analysis of such claims. For example, in Fiers v. Revel, 25 

USPQ2d 1601, 1606 (Fed. Cir. 1993), the court stated that: 

If a conception of a DNA requires a precise definition, such as by structure, 
formula, chemical name or physical properties, as we have held, then a description 
also requires that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts 
have noted that the claims attempted to define the claimed DNA in terms of functional 
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characteristics without any reference to structural features. As set forth by the court in University 

of California v. Eli Lilly and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as "vertebrate 
insulin cDNA" or "mammalian insulin cDNA," without more, is not an adequate 
written description of the genus because it does not distinguish the claimed genus 
from others, except by function. 

Thus, the mere recitation of functional characteristics of a DNA, without the definition of 

structural features, has been a common basis by which courts have found invalid claims to DNA. 

For example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written 

description requirement the following claim of U.S. Patent No. 4,652,525: 

1. A recombinant plasmid replicable in procaryotic host containing within its 
nucleotide sequence a subsequence having the structure of the reverse transcript of 
an mRNA of a vertebrate, which mRNA encodes insulin. 

In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following 

count: 
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interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an 
adequate written description of the DNA of the count because that application mentioned a 
potential method for isolating the DNA. The Revel priority application, however, did not have a 
description of any particular DNA structure corresponding to the DNA of the count. The court 
therefore found that the Revel priority application lacked an adequate written description of the 
subject matter of the count. 

Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional 

characteristics and were found not to comply with the written description requirement of 35 

U.S.C. §112; Le., "an mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA 

which codes for a human fibroblast interferon-beta polypeptide" in Fiers. In contrast to the 

situation in Lilly and Fiers, the claims at issue in the present application define polynucleotides 

in terms of chemical structure, rather than on functional characteristics. The "variant language" 

of independent claim 1 recites chemical structure to define the claimed genus: 

1. An isolated polynucleotide comprising a polynucleotide sequence selected from 
the group consisting of ... b) a naturally occurring polynucleotide sequence 
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having at least 90% sequence identity to a polynucleotide sequence selected from 
the group consisting of SEQ ID NO: 1-14, 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject matter of the 
present claims is defined in terms of the chemical structure of SEQ ID NO:4. There is no 
recitation of the functional characteristics of the claimed polynucleotides. The polynucleotides 
defined in the claims of the present application recite structural features, and cases such as Lilly 
and Fiers stress that the recitation of structure is an important factor to consider in a written 
description analysis of claims to nucleic acids. By failing to base its written description inquiry 
"on whatever is now claimed," the Office Action failed to provide an appropriate analysis of the 
present claims and how they differ from those found not to satisfy the written description 
requirement in Lilly and Fiers 

B. The present claims do not define a genus which is "highly variant" 

Furthermore, the claims at issue do not describe a genus which could be characterized as 
"highly variant." Available evidence illustrates that the claimed genus is of narrow scope. 

In support of this assertion, the Examiner's attention is directed to the enclosed reference 
(Reference No. 5) by Brenner et al. Through exhaustive analysis of a data set of proteins with 
known structural and functional relationships and with <90% overall sequence identity, Brenner 
et al. have determined that 30% identity is a reliable threshold for establishing evolutionary 
homology between two amino acid sequences aligned over at least 150 residues. (Brenner et al., 
pages 6073 and 6076.) Furthermore, local identity is particularly important in this case for 
assessing the significance of the alignments, as Brenner et al. further report that ^40% identity 
over at least 70 residues is reliable in signifying homology between proteins. (Brenner et al., 
page 6076.) 

The present application is directed, inter alia, to nucleic acid sequences comprising 
polynucleotides associated with disease detection and treatment molecules (mddt). In 
accordance with Brenner et al, naturally occurring molecules may exist which could be 
characterized as mddt and which have only 30% identity over at least 150 residues to the 
polypeptide encoded by SEQ ID NO:4. This variation is far less than that of all potential disease 
detection and treatment molecule proteins related to SEQ ID NO:4. 
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C. The state of the art at the time of the present invention is further advanced 

than at the time of the Lilly and Fiers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to 
comply with the written description requirement of 35 U.S.C §112. The '525 patent claimed the 
benefit of priority of two applications, Application Serial No. 801,343 filed May 27, 1977, and 
Application Serial No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the 
benefit of priority of an Israeli application filed on November 21, 1979. Thus, the written 
description inquiry in those case was based on the state of the art at essentially at the "dark ages" 
of recombinant DNA technology. 

The present application has a priority date of August 5, 1999. Much has happened in the 
development of recombinant DNA technology in the 20 years from the time of filing of the 
applications involved in Lilly and Fiers and the present application. For example, the technique 
of polymerase chain reaction (PCR) was invented. Highly efficient cloning and DNA sequencing 
technology has been developed. Large databases of protein and nucleotide sequences have been 
compiled. Much of the raw material of the human and other genomes has been sequenced. With 
these remarkable advances one of skill in the art would recognize that, given the sequence 
information of SEQ ED NO:4, and the additional extensive detail provided by the subject 
application, the present inventors were in possession of the claimed polynucleotide variants at the 
time of filing of this application. 

D. Summary 

The Office Action failed to base its written description inquiry "on whatever is now 
claimed." Consequently, the Action did not provide an appropriate analysis of the present claims 
and how they differ from those found not to satisfy the written description requirement in cases 
such as Lilly and Fiers. In particular, the claims of the subject application are fundamentally 
different from those found invalid in Lilly and Fiers. The subject matter of the present claims is 
defined in terms of the chemical structure of SEQ ID NO:4. The courts have stressed that 
structural features are important factors to consider in a written description analysis of claims to 
nucleic acids. In addition, the genus of DNA defined by the present claims is not "highly 
variant," as evidenced by Brenner et al. Furthermore, there have been remarkable advances in 
the state of the art since the Lilly and Fiers cases, and these advances were given no 
consideration whatsoever in the position set forth by the Office Action. 
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For at least the reasons set forth above, the Specification provides an adequate written 

description of the claimed subject matter, and withdrawal of this rejection is therefore requested. 
Indefinitencss rejection under 35 U.S.C. § 112. second paragraph 

Claims 6 and 17 were rejected under the written description 35 U.S.C. § 112, second 
paragraph, as allegedly being indefinite for failing to particularly point out and distinctly claim 
the subject matter which applicant regards as the invention. 

Applicants point out that claim 6, a method for detecting a target polynucleotide in a 
sample, comprises two steps: first the hybridization step, and secondly the detection of the 
presence or absence of a hybridization complex. The Examiner's attention is respectfully 
directed to the "Hybridization and Genetic Analysis" section in the specification at page 23, lines 
20-35; and to page 24, lines 9-27, wherein the specification describes methods of detecting a 
target polynucleotide in a sample. These passages make clear that "the detection of the presence 
or absence of a hybridization complex" is pail of the method for detecting a target cf 
polynucleotide in a sample. Based upon the disclosure, one of skill in the art would understand 
what is meant by the claim. 

Similarly for claim 17, the Examiner's attention is respectfully directed to the "Transcript 
Imaging" section in the specification at page 30, lines 10-35; and to page 31, lines 1-2, wherein 
the specification describes that quantifying the expression of the polynucleotide in a sample is 
part of the method for generating a transcript image. In other words, the pattern of gene 
expression is defined by, inter alia, the number of expressed genes and their abundance. Based 
upon the disclosure, one of skill in the art would understand what is meant by the claim. Further, 
use of the language "the elements of the microarray" in claim 17 has been clarified. 

For at least the above reasons, withdrawal of the indefiniteness rejection is requested. 
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CONCLUSION 

In light of the above amendments and remarks, Applicants submit that the present 
application is fully in condition for allowance, and request that the Examiner withdraw the 
outstanding objections/rejections. Early notice to that effect is earnestly solicited. 

If the Examiner contemplates other action, or if a telephone conference would expedite 
allowance of the claims, Applicants invite the Examiner to contact the undersigned at the number 
listed below. 

Please charge Deposit Account No. 09-0108 in the amount of $ 410.00 as set forth in 
the enclosed fee transmittal letter. If the USPTO determines that an additional fee is necessary, 
please charge any required fee to Deposit Account No. 09-0108. 



Respectfully submitted, 
INCYTE CORPORATION 

Date: ^ ~> fed - ~^ ~~ z o^~ r z 

Richard C. Ekstrom 
Reg. No. 37,027 

Direct Dial Telephone: (650) 843-7352 



Date: 



Yu-Mei Eureka Wanj 
Reg. No. 50,510 
Direct Dial Telephone: (650) 621-8740 
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IN THE CLAIMS 

This listing of the claims replaces all prior versions of the claims in the application. 

1. (Original) An isolated polynucleotide comprising a polynucleotide sequence selected 
from the group consisting of: 

a) a polynucleotide sequence selected from the group consisting of SEQ ID NO: 1-14, 

b) a naturally occurring polynucleotide sequence having at least 90% sequence identity to 
a polynucleotide sequence selected from the group consisting of SEQ ID NO: 1-14, 

c) a polynucleotide sequence complementary to a), 

d) a polynucleotide sequence complementary to b), and 

e) an RNA equivalent of a) through d). 

2. -3. (Canceled) 

f . 1 filial / /I VWlil^V/dlLlUll LKJl U1W VJ-V LV'V-'LlWli ui vApivoaiun Ul UIOVUOV UVlVWUUn UllU 

treatment molecule polynucleotides comprising at least one of the polynucleotides of claim 1 and 
a detectable label. 

5. (Original) A method for detecting a target polynucleotide in a sample, said target 
polynucleotide having a sequence of a polynucleotide of claim 1, the method comprising: 

a) amplifying said target polynucleotide or fragment thereof using polymerase chain 
reaction amplification, and 

b) detecting the presence or absence of said amplified target polynucleotide or fragment 
thereof, and, optionally, if present, the amount thereof. 

6. (Original) A method for detecting a target polynucleotide in a sample, said target 
polynucleotide comprising a sequence of a polynucleotide of claim 1, the method comprising: 

a) hybridizing the sample with a probe comprising at least 20 contiguous nucleotides 
comprising a sequence complementary to said target polynucleotide in the sample, and which 
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probe specifically hybridizes to said target polynucleotide, under conditions whereby a 

hybridization complex is formed between said probe and said target polynucleotide, and 

b) detecting the presence or absence of said hybridization complex, and, optionally, if 
present, the amount thereof. 

7.-8. (Canceled) 

9. (Original) A recombinant polynucleotide comprising a promoter sequence operably 
linked to a polynucleotide of claim 1. 

10. (Original) A cell transformed with a recombinant polynucleotide of claim 9. 

11. (Canceled) 

12. (Original) A method for producing a disease detection and treatment molecule 
polypeptide, the method comprising: 

a) culturing a cell under conditions suitable for expression of the disease detection and 
treatment molecule polypeptide, wherein said cell is transformed with a recombinant 
polynucleotide of claim 9, and 

b) recovering the disease detection and treatment molecule polypeptide so expressed. 

13. (Currently Amended) A purified disease detection and treatment molecule 
polypeptide (MDDT) selected from the group consisting of: 

a) a polypeptide comprising the polypeptide encoded bv SEP ID NO:4, and 

b) a polypeptide comprising a naturally-occurring amino acid sequence at l east 90% 

identical to the amino acid sequence of the polypeptide encoded b v SEP ID NO:4. 

14. (Canceled) 
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15. (Currently Amended) A method of identifying a test compound which specifically 

binds to the disease detection and treatment molecule polypeptide of claim 13, the method 
comprising the steps of : 

a) providing a test compound; 

b) combining the disease detection and treatment molecule polypeptide with the test 
compound for a sufficient time and under suitable conditions for binding; and 

c) detecting binding of the disease detection and treatment molecule polypeptide to the 
test compound, thereby identifying the test compound which specifically binds the disease 
detection and treatment molecule polypeptide. 

16. (Previously Presented) A microarray wherein at least one element of the microarray 
is a polynucleotide of claim 1. 

17. (Currently Amended) A method for generating a transcript image of a sample which 
contains polynucleotides, the method comprising the steps of:: 

a) labeling the polynucleotides of the sample, 

b) contacting the elements of the microarray of claim 16 with the labeled polynucleotides 
of the sample under conditions suitable for the formation of a hybridization complex, and 

c) quantifying the expression of the polynucleotides in the sample. 

18. (Canceled) 

19. (Original) A method of claim 6 for toxicity testing of a compound, further 
comprising 

(c) comparing the presence, absence or amount of said target polynucleotide in a first 
biological sample and a second biological sample, wherein said first biological sample has been 
contacted with said compound, and said second sample is a control, whereby a change in 
presence, absence or amount of said target polynucleotide in said first sample, as compared with 
said second sample, is indicative of toxic response to said compound. 
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20. (Original) A method for screening a compound for effectiveness in altering 

expression of a target polynucleotide, wherein said target polynucleotide comprises a 
polynucleotide sequence of claim 1, the method comprising: 

a) exposing a sample comprising the target polynucleotide to a compound, under 
conditions suitable for the expression of the target polynucleotide, 

b) detecting altered expression of the target polynucleotide, and 

c) comparing the expression of the target polynucleotide in the presence of varying 
amounts of the compound and in the absence of the compound. 

21. -56. (Canceled) 

57. (Previously Presented) A method for assessing toxicity of a test compound, said 
method comprising: 

a) treating a biological sample containing nucleic acids with the test compound; 

b) hybridizing the nucleic acids of the treated biological sample with a prebe comprising 
at least 20 contiguous nucleotides of a polynucleotide of claim 1 under conditions whereby a 
specific hybridization complex is formed between said probe and a target polynucleotide in the 
biological sample, said target polynucleotide comprising a polynucleotide sequence of a 
polynucleotide of claim 1 or fragment thereof; 

c) quantifying the amount of hybridization complex; and 

d) comparing the amount of hybridization complex in the treated biological sample with 
the amount of hybridization complex in an untreated biological sample, wherein a difference in 
the amount of hybridization complex in the treated biological sample is indicative of toxicity of 
the test compound. 

58. (Previously Presented) An array comprising different nucleotide molecules affixed in 
distinct physical locations on a solid substrate, wherein at least one of said nucleotide molecules 
comprises a first oligonucleotide or polynucleotide sequence specifically hybridizable with at 
least 30 contiguous nucleotides of a target polynucleotide, said target polynucleotide having a 
sequence of claim 1. 
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59. (Previously Presented) An array of claim 58, wherein said first oligonucleotide or 

polynucleotide sequence is completely complementary to at least 30 contiguous nucleotides of 
said target polynucleotide. 

60. (Previously Presented) An array of claim 58, which is a microarray. 

61. (Previously Presented) An array of claim 58, further comprising said target 
polynucleotide hybridized to said first oligonucleotide or polynucleotide. 

62. (Previously Presented) An array of claim 58, wherein a linker joins at least one of 
said nucleotide molecules to said solid substrate. 

63. (Previously Presented) An array of claim 58, wherein each distinct physical location 
on the substrate contains multiple nucleotide molecules having the same sequence, and each 
distinct physical location on the substrate contains nucleotide molecules having a sequence 
which differs from the sequence of nucleotide molecules at another physical location on the 
substrate. 
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1 . An important feature of the work of many molecular biologists is identifying which 
genes are switched on and off in a cell under different environmental conditions or 
subsequent to xenobiotic challenge. Such information has many uses, including the 
deciphering of molecular pathways and facilitating the development of new experimental 
and diagnostic procedures. However, the student of gene hunting should be forgiven for 
perhaps becoming confused by the mountain of information available as there appears to be 
almost as many methods of discovering differentially expressed genes as there are research 
groups using the technique. 

2 . The aim of this review was to clarify the main methods of differential gene expression 
analysis and the mechanistic principles underlying them. Also included is a discussion on 
some of the practical aspects of using this technique. Emphasis is placed on the so-called 
* open ' systems, which require no prior knowledge of the genes contained within the study 
model. Whilst these will eventually be replaced by 'closed ' systems in the study of human, 
mouse and other commonly studied laboratory animals, they will remain a powerful tool for 
those examining less fashionable models. 

3. The use of suppress ion-PCR subtractive hybridization is exemplified in the 
identification of up- and down- regulated genes in rat liver following exposure to pheno- 
barbital, a well-known inducer of the drug metabolizing enzymes. 

4. Differential gene display provides a coherent platform for building libraries and 
microchip arrays of 'gene fingerprints' characteristic of known enzyme inducers and 
xenobiotic toxicants, which may be interrogated subsequently for the identification and 
characterization of xenobiotics of unknown biological properties. 



Introduction 

It is now apparent that the development of almost all cancers and many non- 
neoplastic diseases are accompanied by altered gene expression in the affected cells 
compared to their normal state (Hunter 1991, Wynford-Thomas 1991, Vogelstein 
and Kinzler 1993, Semenza 1994, Cassidy 1995, Kleinjan and Van Hegningen 1998). 
Such changes also occur in response to external stimuli such as pathogenic micro- 
organisms (Rohn et al. 1996, Singh et al. 1997, Griffin and Krishna 1998, Lunney 
1998) and xenobiotics (Sewall et al. 1995, Dogra et al. 1998, Ramana and Kohli 
1998), as well as during the development of undifferentiated cells (Hecht 1998, 
Rudin and Thompson 1998, Schneider-Maunoury et al. 1998). The potential 
medical and therapeutic benefits of understanding the molecular changes which 
occur in any given cell in progressing from the normal to the 'altered* state are 
enormous. Such profiling essentially provides a 'fingerprint* of each step of a 
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celFs development or response and should help in the elucidation of specific and 
sensitive biomarkers representing, for example, different types of cancer or previous 
exposure to certain classes of chemicals that are enzyme inducers. 

In drug metabolism, many of the xenobiotic-metabolizing enzymes (including 
the well-characterized isoforms of cytochrome P450) are inducible by drugs and 
chemicals in man (Pelkonen et al. 1998), predominantly involving transcriptional 
activation of not only the cognate cytochrome P450 genes, but additional cellular 
proteins which may be crucial to the phenomenon of inductioa Accordingly, the 
development of methodology to identify and assess the full complement of genes 
that are either up- or down-regulated by inducers are crucial in the development of 
knowledge to understand the precise molecular mechanisms of enzyme induction 
and how this relates to drug action. Similarly, in the field of chemical-induced 
toxicity, it is now becoming increasingly obvious that most adverse reactions to 
drugs and chemicals are the result of multiple gene regulation, some of which are 
causal and some of which are casually -related to the toxicological phenomenon per 
se. This observation has led to an upsurge in interest in gene-profiling technologies 
which differentiate between the control and toxin-treated gene pools in target tissues 
and is, therefore, of value in rationalizing the molecular mechanisms of xenobiotic- 
induced toxicity. Knowledge of toxin- dependent gene regulation in target tissues is 
not solely an academic pursuit as much interest has been generated in the 
pharmaceutical industry to harness this technology in the early identification of toxic 
drug candidates, thereby shortening the developmental process and contributing 
substantially to the safety assessment of new drugs. For example, if the gene profile 
in response to say a testicular toxin that has been well-characterized in vivo could be 
determined in the testis, then this profile would be representative of all new drug 
candidates which act via this specific molecular mechanism of toxicity, thereby 
providing a useful and coherent approach to the early detection of such toxicants. 
Whereas it would be informative to know the identity and functionality of all genes 
up/ down regulated by such toxicants, this would appear a longer term goal, as the 
majority of human genes have not yet been sequenced, far less their functionality 
determined. However, the current use of gene profiling yields a pattern of gene 
changes for a xenobiotic of unknown toxicity which may be matched to that of well- 
characterized toxins, thus alerting the toxicologist to possible in vivo similarities 
between the unknown and the standard, thereby providing a platform for more 
extensive toxicological examination. Such approaches are beginning to gain 
momentum, in that several biotechnology companies are commercially producing 
'gene chips' or 'gene arrays' that may be interrogated for toxicity assessment of 
xenobiotics. These chips consist of hundreds/ thousands of genes, some of which are 
degenerate in the sense that not all of the genes are mechanistically-related to any 
one toxicological phenomenon. Whereas these chips are useful in broad-spectrum 
screening, they are maturing at a substantial rate, in that gene arrays are now 
becoming more specific, e.g. chips for the identification of changes in growth factor 
families that contribute to the aetiology and development of chemically-induced 
neoplasias. 

Although documenting and explaining these genetic changes presents a 
formidable obstacle to understanding the different mechanisms of development and 
disease progression, the technology is now available to begin attempting this difficult 
challenge. Indeed, several 'differential expression analysis' methods have been 
developed which facilitate the identification of gene products that demonstrate 
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altered expression in cells of one population compared to another. These methods 
have been used to identify differential gene expression in many situations, including 
invading pathogenic microbes (Zhao et aL 1998), in cells responding to extracellular 
and intracellular microbial invasion (Duguid and Dinauer 1990, Ragno et al. 1997, 
Maldarelli et aL 1998), in chemically treated cells (Syed et aL 1997, Rockett et al. 
1999), neoplastic cells (Liang et al. 1992, Chang and Terzaghi-Howe 1998), 
activated cells (Gurskaya et aL 1996, Wan et aL 1996), differentiated cells (Hara et 
al. 1991, Guimaraes et al. 1995a, b), and different cell types (Davis et al. 1984, 
Hedrick et al. 1984, Xhu et aL 1998). Although differential expression analysis 
technologies are applicable to a broad range of models, perhaps their most important 
advantage is that, in most cases, absolutely no prior knowledge of the specific genes 
which are up- or down-regulated is required. 

The field of differential expression analysis is a large and complex one, with 
many techniques available to the potential user. These can be categorized into 
several methodological approaches, including: 

(1) Differential screening, 

(2) Subtractive hybridization (SH) (includes methods such as chemical cross- 
linking subtraction — CCLS, suppression-PCR subtractive hybridization — 
SSH, and representational difference analysis — RDA), 

(3) Differential display (DD), 

(4) Restriction endonuclease facilitated analysis (including serial analysis of gene 

(5) Gene expression arrays, and 

(6) Expressed sequence tag (EST) analysis. 

The above approaches have been used successfully to isolate differentially 
expressed genes in different model systems. However, each method has its own 
subtle (and sometimes not so subtle) characteristics which incur various advantages 
and disadvantages. Accordingly, it is the purpose of this review to clarify the 
mechanistic principles underlying the main differential expression methods and to 
highlight some of the broader considerations and implications of this very powerful 
and increasingly popular technique. Specifically, we will concentrate on the so- 
called 'open' systems, namely those which do not require any knowledge of gene 
sequences and, therefore, are useful for isolating unknown genes. Two 'closed' 
systems (those utilising previously identified gene sequences), EST analysis and the 
use of DNA arrays, will also be considered briefly for completeness. Whilst 
emphasis will often be placed on suppression PCR subtractive hybridization (SSH, 
the approach employed in this laboratory), it is the aim of the authors to highlight, 
wherever possible, those areas of common interest to those who use, or intend to use, 
differential gene expression analysis. 



Differential cDNA library screening (DS) 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular analysis, 
recognition of the importance of differential gene expression and characterization of 
differentially expressed genes has existed for many years. One of the original 
approaches used to identify such genes was described 20 years ago by St John and 
Davis (1979). These authors developed a method, termed 'differential plaque filter 
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hybridization *, which was used to isolate galactose-inducible DNA sequences from 
yeast. The theory is simple: a genomic DNA library is prepared from normal, 
unstimulated cells of the test organism/tissue and multiple filter replicas are 
prepared. These replica blots are probed with radioactively (or otherwise) labelled 
complex cDNA probes prepared from the control and test cell mRNA populations. 
Those mRNAs which are differentially expressed in the treated cell population will 
show a positive signal only on the filter probed with cDNA from the treated cells. 
Furthermore, labelled cDNA from different test conditions can be used to probe 
multiple blots, thereby enabling the identification of mRNAs which are only up- 
regulated under certain conditions. For example, St John and Davis (1979) screened 
replica filters with acetate-, glucose- and galactose-derived probes in order to obtain 
genes induced specifically by galactose metabolism. Although groundbreaking in its 
time this method is now considered insensitive and time-consuming, as up to 2 
months are required to complete the identification of genes which are differentially 
expressed in the test population. In addition, there is no convenient way to check 
that the procedure has worked until the whole process has been completed. 

Subtractive Hybridization (SH) 

The developing concept of differential gene expression and the success of early 
approaches such as that described by St John and Davis (1979) soon gave rise to a 
search for more convenient methods of analysis. One of the first to be developed was 
SH, numerous variations of which have since been reported (see below). In general, 
this approach involves hybridization of mRNA /cDN A from one population (tester) 
to excess mRNA/cDNA from another (driver), followed by separation of the 
unhybridized tester fraction (differentially expressed) from the hybridized common 
sequences. This step has been achieved physically, chemically and through the use 
of selective polymerase chain reaction (PCR) techniques. 

Physical separation 

Original subtractive hybridization technology involved the physical separation 
of hybridized common species from unique single stranded species. Several methods 
of achieving this have been described, including hydroxyapatite chromatography 
(Sargent and Dawid 1983), avidin-biotin technology (Duguid and Dinauer 1990) 
and oligodT-latex separation (Hara et al. 1991). In the first approach, common 
mRNA species are removed by cDNA (from test cells)-mRNA (from control cells) 
subtractive hybridization followed by hydroxyapatite chromatography, as hydroxy- 
apatite specifically adsorbs the cDNA-mRNA hybrids. The unabsorbed cDNA is 
then used either for the construction of a cDNA library of differentially expressed 
genes (Sargent and Dawid 1983, Schneider et al. 1988) or directly as a probe to 
screen a preselected library (Zimmerman et al. 1 980, D avis et al. 1 984, Hedrick et al. 
1984). A schematic diagram of the procedure is shown in figure 1. 

Less rigorous physical separation procedures coupled with sensitivity enhancing 
PCR steps were later developed as a means to overcome some of the problems 
encountered with the hydroxyapatite procedure. For example, Daguid and Dinauer 
(1990) described a method of subtraction utilizing biotin-affinity systems as a means 
to remove hybridized common sequences. In this process, both the control and 
tester mRNA populations are first converted to cDN A and an adaptor ( ' oligovector ', 
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or 
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Figure 1. The hydroxyapatite method of sub tractive hybridization. cDNA derived from the 
treated /altered (tester) population is mixed with a large excess of mRNA from the control (driver) 
population. Following hybridization, mRNA-cDNA hybrids are removed by hydroxyapatite 
chromatography. The only cDNAs which remain are those which are differentially expressed in 
the treated /altered population. In order to facilitate the recovery of full length clones, small cDNA 
fragments are removed by exclusion chromatography. The remaining cDN As are then cloned into 
a vector for sequencing, or labelled and used directly to probe a library, as described by Sargent 
and Dawid (1983). 



containing a restriction site) ligated to both sides. Both populations are then 
amplified by PCR, but the driver cDNA population is subsequently digested with 
the adaptor-containing restriction endonuclease. This serves to cleave the oligo- 
vector and reduce the amplification potential of the control population. The digested 
control population is then biotinylated and an excess mixed with tester cDNA. 
Following denaturation and hybridization, the mix is applied to a biocytin column 
(streptavidin may also be used) to remove the control population, including 
heteroduplexes formed by annealing of common sequences from the tester 
population. The procedure is repeated several times following the addition of fresh 
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Figure 2. The use of oligodT^ latex to perform subtractive hybridization. mRNA extracted from the 
control (driver) population is converted to anchored cDNA using polydT oligonucleotides 
attached to latex beads. mRNA from the treated/altered (tester) population is repeatedly 
hybridized against an excess of the anchored driver cDNA. The final population of mRNA is 
tester specific and can be converted into cDNA for cloning and other downstream applications, as 
described by Hara et al. (1991). 
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control cDNA. In order to further enrich those species differentially expressed in 
the tester cDNA, the subtracted tester population is amplified by PCR following 
every second subtraction cycle. After six cycles of subtraction (three reamplification 
steps) the reaction mix is ligated into a vector for further analysis. 

In a slightly different approach, Hara et al. (1991) utilized a method whereby 
oligo(dT 30 ) primers attached to a latex substrate are used to first capture mRNA 
extracted from the control population. Following 1st strand cDNA synthesis, the 
RNA strand of the heteroduplexes is removed by heat denaturation and centri- 
fugation (the cDNA-oligotex-dT^ forms a pellet and the supernatant is removed). 
A quantity of tester mRNA is then repeatedly hybridized to the immobilized control 
(driver) cDNA (which is present in 20-fold excess). After several rounds of 
hybridization the only mRNA molecules left in the tester mRNA population are 
those which are not found in the driver cDN A-oligotex-dT^ population. These 
tester-specific mRNA species are then converted to cDNA and, following the 
addition of adaptor sequences, amplified by PCR. The PCR products are then 
ligated into a vector for further analysis using restriction sites incorporated into the 
PCR primers. A schematic illustration of this subtraction process is shown in figure 
2. 

However, all these methods utilising physical separation have been described as 
inefficient due to the requirement for large starting amounts of mRNA, significant 
loss of material during the separation process and a need for several rounds of 
hybridization. Hence, new methods of differential expression analysis have recently 
been designed to eliminate these problems. 

Chemical Cross-Linking Subtraction ( CCLS) 

In this technique, originally described by Hampson et al. (1992), driver mRNA 
is mixed with tester cDNA (1st strand only) in a ratio of > 20:1. The common 
sequences form cDNA :mRNA hybrids, leaving the tester specific species as single 
stranded cDN A. Instead of physically separating these hybrids, they are inactivated 
chemically using 2,5 diaziridinyl-l,4-benzoquinone (DZQ). Labelled probes are 
then synthesized from the remaining single stranded cDNA species (unreacted 
mRNA species remaining from the driver are not converted into probe material due 
to specificity of Sequenase T7 DNA polymerase used to make the probe) and used 
to screen a cDN A library made from the tester cell population. A schematic diagram 
of the system is shown in figure 3. 

It has been shown that the differentially expressed sequences can be enriched at 
least 300-fold with one round of subtraction (Hampson et al. 1992), and that the 
technique should allow isolation of cDNAs derived from transcripts that are present 
at less than 50 copies per cell. This equates to genes at the low end of intermediate 
abundance (see table 1). The main advantages of the CCLS approach are that it is 
rapid, technically simple and also produces fewer false positives than other 
differential expression analysis methods. However, like the physical separation 
protocols, a major drawback with CCLS is the large amount of starting material 
required (at least 10 fig RNA). Consequently, the technique has recently been 
refined so that a renewable source of RNA can be generated. The degenerate random 
oligonucleotide primed (DROP) adaptation (Hampson et al. 1996, Hampson and 
Hampson 1997) uses random hexanucleotide sequences to prime solid phase- 
synthesized cDNA. Since each primer includes a T7 polymerase promotor sequence 
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Figure 3. Chemical cross-linking subtraction. Excess driver mRNA is mixed with 1 st strand tester 
cDNA. The common sequences form mRNA:cDNA hybrids which are cross linked with 2,5 
diaziridinyl-l,4-benzoquinone (DZQ) and the remaining cDNA sequences are differentially 
expressed in the tester population. Probes are made from these sequences using Sequenase 2.0 
DNA polymerase, which lacks reverse transcriptase activity and, therefore, does not react with the 
remaining mRNA molecules from the driver. The labelled probes are then used to screen a cDNA 
library for clones of differentially expressed sequences. Adapted from Walter et al. (1996), with 
permission. 



Table 1. The abundance of mRNA species and classes in a typical mammalian cell. 



mRNA 
class 


Copies of 

each 
species/cell 


No. of mRNA 
species in 
class 


Mean % of 
each species 
in class 


Mean mass 
(ng) of each 
species/ //g 
total RNA 


Abundant 


12000 


4 


3.3 


1.65 


Intermediate 


300 


500 


0.08 


0.04 


Rare 


15 


11000 


0.004 


0.002 



Modified from Bertioli et al. (1995). 
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at the 5' end, the final pool of random cDNA fragments is a PCR-renewable cDNA 
population which is representative of the expressed gene pool and can be used to 
synthesize sense RNA for use as driver material. Furthermore, if the final pool of 
random cDNA fragments is reamplified using biotinylated T7 primer and random 
hexamer, the product can be captured with streptavidin beads and the antisense 
strand eluted for use as tester. Since both target and driver can be generated from 
the same DROP product, subtraction can be performed in both directions (i.e. for 
up- and down-regulated species) between two different DROP products. 

Representational Difference Analysis (RDA) 

RDA of cDNA (Hubank and Schatz 1994) is an extension of the technique 
originally applied to genomic DNA as a means of identifying differences between 
two complex genomes (Lisitsyn et al. 1993). It is a process of subtraction and 
amplification involving subtractive hybridization of the tester in the presence of 
excess driver. Sequences in the tester that have homologues in the driver are 
rendered unamplifiable, whereas those genes expressed only in the tester retain the 
ability to be amplified by PCR. The procedure is shown schematically in figure 4. 

In essence, the driver and tester mRN A populations are first converted to cDN A 
and amplified by PCR following the ligation of an adaptor. The adaptors are then 
removed from both populations and a new (different) adaptor ligated to the 
amplified tester population only. Driver and tester populations are next melted and 
hybridized together in a ratio of 100:1. Following hybridization, only tester : tester 
homohybrids have 5 'adaptors at each end of the DNA duplex and can, thus, be filled 
in at both V ends. Hence, only these molecules are amplified exponentially during 
the subsequent PCR step. Although tester : driver heterohybrids are present, they 
only amplify in a linear fashion, since the strand derived from the driver has no 
adaptor to which the primer can bind. Driver: driver heterohybrids have no 
adaptors and, therefore, are not amplified. Single stranded molecules are digested 
with mung bean nuclease before a further PCR-enrichment of the tester : tester 
homohybrids. The adaptors on the amplified tester population are then replaced and 
the whole process repeated a further two or three times using an increasing excess of 
driver (Hubank and Shatz used a tester : driver ratio of 1:400, 1:80000 and 
1 : 800000 for the second, third and fourth hybridizations, respectively). Different 
adaptors are ligated to the tester between successive rounds of hybridization and 
amplification to prevent the accumulation of PCR products that might interfere with 
subsequent amplifications. The final display is a series of differentially expressed 
gene products easily observable on an ethidium bromide gel. 

The main advantages of RDA are that it offers a reproducible and sensitive 
approach to the analysis of differentially expressed genes. Hubank and Schatz (1994) 
reported that they were able to isolate genes that were differentially expressed in 
substantially less than 1% of the cells from which the tester is derived. Perhaps the 
main drawback is that multiple rounds of ligation, hybridization, amplifiation and 
digestion are required. The procedure is, therefore, lengthier than many other 
differential display approaches and provides more opportunity for operator-induced 
error to occur. Although the generation of false positives has been noted, this has 
been solved to some degree by O'Neill and Sinclair (1997) through the use of HPLC- 
purified adaptors. These are free of the truncated adaptors which appear to be a 
major source of the false positive bands. A very similar technique to RDA, termed 
linker capture subtraction (LCS) was described by Yang and Sytowski (1996). 
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ds control (driver) cDNA ds test (tester) cDNA 



Digest with restriction enzyme 



Ligate to 
dephosphorylated 
12/24 adaptor 
strands 



Melt12mer 



Fill in 3' ends (Taq), add 

primer ( ) and 

amplify 



Digest and ligate 
new 12/24 adaptor 



Mix 100:1, melt and hybridize 



11 1 
Fill in ends, add primer ( — ) and amplify 

11 1 
Linear amplification Exponential amplification No amplification 

1 

Digest PCR products with mung bean nuclease to remove 
ssDNA molecules present after amplification 

1 

First difference 

Figure 4. The representational difference analysis (RDA) technique. Driver and tester cDNA are 
digested with a 4-cutter restriction enzyme such as Dpnll. The 1 st set of 12/24 adaptor strands 
(oligonucleotides) are ligated to each other and the digested cDNA products. The 12mer is 
subsequently melted away and the Vends filled in using Taq DNA polymerase. Each cDNA 
population is then amplified using PCR, following which the 1 st set of adaptors is removed with 
Dpnll. A second set of 12/24 adaptor strands is then added to the amplified tester cDNA 
population, after which the tester is hybridized against a large excess of driver. The 12mer 
adaptors are melted and the 3 ends filled in as before. PCR is carried out with primers identical 
to the new 24mer adaptor. Thus, the only hybridization products which are exponentially 
amplified are those which are tester: tester combinations. Following PCR, ssDNA products are 
removed with mung bean nuclease, leaving the 'first difference product*. This is digested and a 
third set of 12/24 adaptors added before repeating the subtraction process from the hybridization 
stage. The process is repeated to the 3 rd or 4 th difference product, as described by Lisitsyn et al. 
(1993) and Hubank and Schatz (1994). 
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Suppression PCR Subtr active Hybridization (SSH) 

The most recent adaptation of the SH approach to differential expression 
analysis was first described by Diatchenko et al. (1996) and Gurskaya et al. (1996). 
They reported that a 1000-5000 fold enrichment of rare cDNAs (equivalent to 
isolating mRNAs present at only a few copies per cell) can be obtained without the 
need for multiple hybridizations/subtractions. Instead of physical or chemical 
removal of the common sequences, a PCR-based suppression system is used (see 
figure 5). 

In SSH, excess driver cDNA is added to two portions of the tester cDNA which 
have been ligated with different adaptors. A first round of hybridization serves to 
enrich differentially expressed genes and equalize rare and abundant messages. 
Equalization occurs since reannealing is more rapid for abundant molecules than for 
rarer molecules due to the second order kinetics of hybridization (James and Higgins 
1985). The two primary hybridization mixes are then mixed together in the presence 
of excess driver and allowed to hybridize further. This step permits the annealing of 
single stranded complementary sequences which did not hybridize in the primary 
hybridization, and in doing so generates templates for PCR amplification. Although 
there are several possible combinations of the single stranded molecules present in 
the secondary hybridization mix, only one particular combination (differentially 
expressed in the tester cDN A composed of complimentary strands having different 
adaptors) can amplify exponentially. 

Having obtained the final differential display, two options are available if cloning 
of cDN As is desired. One is to transform the whole of the final PCR reaction into 
competent cells. Transformed colonies can then be isolated and their inserts 
characterized by sequencing, restriction analysis or PCR. Alternatively, the final 
PCR products can be resolved on a gel and the individual bands excised, reamplified 
and cloned. The first approach is technically simpler and less time consuming. 
However, ligation/tran'sformation reactions are known to be biased towards the 
cloning of smaller molecules, and so the final population of clones will probably not 
contain a representative selection of the larger products. In addition, although 
equalization theoretically occurs, observations in this laboratory suggest that this is 
by no means perfectly accomplished. Consequently, some gene species are present 
in a higher number than others and this will be represented in the final population 
of clones. Thus, in order to obtain a substantial proportion of those gene species that 
actually demonstrate differential expression in the tester population, the number of 
clones that will have to be screened after this step may be substantial. The second 
approach is initially more time consuming and technically demanding. However, it 
would appear to offer better prospects for cloning larger and low abundance gel 
products. In addition, one can incorporate a screening step that differentiates 
different products of different sequences but of the same size (HA-staining, see 
later). In this way, a good idea of the final number of clones to be isolated and 
identified can be achieved. 

An alternative (or even complementary) approach is to use the final differential 
display reaction to screen a cDNA library to isolate full length clones for further 
characterization, or a DNA array (see later) to quickly identify known genes. SSH 
has been used in this laboratory to begin characterization of the short-term gene 
expression profiles of enzyme-inducers such as phenobarbital (Rockett et al. 1997) 
and Wy-14,643 (Rockett et al. unpublished observations). The isolation of 
differentially expressed genes in this manner enables the construction of a fingerprint 
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Tester cDNA with adaptor 1 



Driver cDNA 
(in excess) 



Tester cDNA with adaptor 2 




Mix samples, add fresh denatured driver, anneai 



a,b,c,d & e 



Fill in ends 





Add primers and 
amplify by PCR 




a, d no amplification 

b no amplification - suppressed due to 

formation of panhandle structure 
c linear amplification 

e exponential amplification 

Figure 5. PCR-select cDNA subtraction. In the primary hybridization, an excess of driver cDNA is 
added to each tester cDNA population. The samples are heat denatured and allowed to hybridize 
for between 3 and 8 h. This serves two purposes : (1) to equalize rare and abundant molecules ; and 
(2) to enrich for differentially expressed sequences— cDNAs that are not differentially expressed 
form type c molecules with the driver. In the secondary hybridization, the two primary 
hybridizations are mixed together without denaturing. Fresh denatured driver can also be added 
at this point to allow further enrichment of differentially expressed sequences. Type e molecules 
are formed in this secondary hybridization which are subsequently amplified using two rounds of 
PCR. The final products can be visualized on an agarose gel, labelled directly or cloned into a 
vector for downstream manipulation. As described by Diatchenko et al. (1996) and Gurskaya 
et al. (1996), with permission. 
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Figure 6. Flow diagram showing method used in this laboratory to isolate and identify clones of genes 
which are differentially expressed in rat liver following short term exposure to the enzyme 
inducers, phenobarbital and Wy-14,643. 



of expressed genes which are unique to each compound and time/dose point. Such 
information could be useful in short-term characterization of the toxic potential of 
new compounds by comparing the gene-expression profiles they elicit with those 
produced by known inducers. Figure 6 shows a flow diagram of the method used to 
isolate, verify and clone differentially expressed genes, and figure 7 shows expression 
profiles obtained from a typical SSH experiment. Subsequent sub-cloning of the 
individual bands, sequencing and gene data base interrogation reveals many genes 
which are either up- or down-regulated by phenobarbital in the rat (tables 2 and 3). 

One of the advantages in using the SSH approach is that no prior knowledge is 
required of which specific genes are up/down-regulated subsequent to xenobiotic 
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Figure 7. SSH display patterns obtained from rat liver following 3-day treatment with WY-14,643 or 
phenobarbital. mRNA extracted from control and treated livers was used to generate the 
differential displays using the PCR-Select cDNA subtraction kit (Clontech). Lane: 1 — lkb 
ladder ; 2 — genes upregulated following Wy, 14-643 treatment; 3 — genes downregulated following 
Wy, 14-643 treatment; 4 — genes upregulated following phenobarbital treatment; 5 — genes 
downregulated following phenobarbital treatment; 6 — lkb ladder. Reproduced from Rockett et 
aL (1997), with permission. 

exposure, and an almost complete complement of genes are obtained. For example, 
the peroxisome proliferator and non-genotoxic hepatocarcinogen Wy,14,643, up- 
regulates at least 28 genes and down-regulates at least 15 in the rat (a sensitive 
species) and produces 48 up- and 37 down-regulated genes in the guinea pig, a 
resistant species (Rockett, Swales, Esda and Gibson, unpublished observations). 
One of these genes, CD81, was up-regulated in the rat and down-regulated in the 
guinea pig following Wy-14,643 treatment. CD81 (alternatively named TAPA-1) is 
a widely expressed cell surface protein which is involved in a large number of cellular 
processes including adhesion, activation, proliferation and differentiation (Levy et 
aL 1998). Since all of these functions are altered to some extent in the phenomena 
of hepatomegaly and non-genotoxic hepatocarcinogenesis, it is intriguing, and 
probably mechanistically-relevant, that CD81 expression is differentially regulated 
in a resistant and susceptible species. However, the down-side of this approach is 
that the majority of genes can be sequenced and matched to database sequences, but 
the latter are predominantly expressed sequence tags or genes of completely 
unknown function, thus partially obscuring a realistic overall assessment of the 
critical genes of genuine biological interest. Notwithstanding the lack of complete 
funtional identification of altered gene expression, such gene profiling studies 
essentially provides a * molecular fingerprint ' in response to xenobiotic challenge, 
thereby serving as a mechanistically-relevant platform for further detailed 
investigations. 

Differential Display (DD) 

Originally described as *RNA fingerprinting by arbitrarily primed PCR ' (Liang 
and Pardee 1992) this method is now more commonly referred to as 'differential 
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Band number 



(approximate 


Highest sequence 




size in bp) 


similarity 


FASTA-EMBL gene identification 


5 (1300) 


93.5% 


CYP2B1 


7 (1000) 


95.1% 


Preproalbumin 






Serum albumin mRNA 


8 (950) 


98.3% 


NCI-CGAP-Prl H. sapiens (EST) 


10 (850) 


95.7% 


CYP2B1 


11 (800) 


Clone 1 94.9% 


CYP2B1 


Clone 2 75.3% 


CYP2B2 


12 (750) 


93.8% 


TRPM-2 mRNA 




Sulfated glycoprotein 


15 (600) 


92.9% 


Preproalbumin 






Serum albumin mRNA 


16(55) 


Clone 1 95.2% 


CYP2B1 


Clone 2 93.6% 


Haptoglobulin mRNA partial alpha 


21 (350) 


99.3% 


18S, 5.8S&28SrRNa 



Bands 1-4, 6, 9, 13, 14, and 17-20 are shown to be false positives by dot blot anaylsis and, therefore, 
are not sequenced. Derived from Rockett et al. (1997). It should be noted that the above genes do not 
represent the complete spectrum of genes which are up-regulated in rat liver by phenobarbital, but 
simply represents the genes sequenced and identified to date. 



Table 3. Genes down-regulated in rat liver following 3-day exposure to phenobarbital. 



Band number 

(approximate Highest sequence 

size in bp) similarity FASTA-EMBL gene identification 



1 (1500) 




95.3% 


3-oxoacyl-CoA thiolase 


2 (1200) 




92.3% 


Hemopoxin mRNA 


3 (1000) 




91.7% 


Alpha-2u-globulin mRNA 


7 (700) 


Clone 1 


77.2% 


M.musculus CI inhibitor 


Clone 2 


94.5% 


Electron transfer flavoprotein 




Clone 3 


91.0% 


M. musculus Topoisomerase 1 (Topo 1) 


8 (650) 


Clone 1 


86.9% 


Soares 2NbMT M. musculus (EST) 


Clone 2 


96.2% 


Alpha-2u-globulin (s-type) mRNA 


9 (600) 


Clone 1 


86.9% 


Soares mouse NML M. musculus (EST) 




Clone 2 


82.0% 


Soares p3NMF 19.5 M. musculus (EST) 


10 (550) 




73.8% 


Soares mouse NML M. musculus (EST) 


11 (525) 




95.7% 


NCI-CGAP-Prl H. sapiens (EST) 


12 (375) 




100.0% 


Ribosomal protein 


13 (23) 


Clone 1 


97.2% 


Soares mouse embryo NbME135 (EST) 




Clone 2 


100.0% 


Fibrinogen B-beta-chain 




Clone 3 


100.0% 


A po lipoprotein E gene 


14 (170) 




96.0% 


Soares p3NMF19.5 M. musculus (EST) 


15 (140) 




97.3% 


Stratagene mouse testis (EST) 


Others: (300) 




96.7% 


R. norvegicus RASP 1 mRNA 


(275) 




93.1% 


Soares mouse mammary gland (EST) 



EST = Expressed sequence tag. Bands 4-6 were shown to be false positives by dot blot analysis and, 
therefore, were not sequenced. Derived from Rockett et al. (1997). It should be noted that the above genes 
do not represent the complete spectrum of genes which are down-regulated in rat liver by phenobarbital, 
but simiply represents the genes sequenced and identified to date. 



display' (DD). In this method, all the mRNA species in the control and treated cell 
populations are amplified in separate reactions using reverse transcriptase-PCR 
(RT-PCR). The products are then run side-by-side on sequencing gels. Those 
bands which are present in one display only, or which are much more intense in one 
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display compared to the other, are differentially expressed and may be recovered for 
further characterization. One advantage of this system is the speed with which it can 
be carried out — 2 days to obtain a display and as little as a week to make and identify 
clones. 

Two commonly used variations are based on different methods of priming the 
reverse transcription step (figure 8). One is to use an oligo dT with a 2-base 'anchor' 
at the 3'-end, e.g. 5' (dT n )CA 3' (Liang and Pardee 1992). Alternatively, an 
arbitrary primer may be used for 1st strand cDNA synthesis (Welsh et al. 1992). 
This variant of RNA fingerprinting has also been called 'RAP' (RNA Arbitrarily 
Primed)-PCR. One advantage of this second approach is that PCR products may be 
derived from anywhere in the RNA, including open reading frames. In addition, it 
can be used for mRNAs that are not polyadenylated, such as many bacterial mRN As 
(Wong and McClelland 1994). In both cases, following reverse transcription and 
denaturation, second strand cDNA synthesis is carried out with an arbitrary primer 
{arbitrary primers have a single base at each position, as compared to random 
primers, which contain a mixture of all four bases at each position). The resulting 
PCR, thus, produces a series of products which, depending on the system (primer 
length and composition, polymerase and gel system), usually includes 50-100 
products per primer set (Band and Sager 1989). When a combination of different 
dT-anchors and arbitrary primers are used, almost all mRN A species from a cell can 
be amplified. When the cDN A products from two different populations are analysed 
side by side on a polyacrylamide gel, differences in expression can be identified and 
the appropriate bands recovered for cloning and further analysis. 

Although DD is perhaps the most popular approach used today for identifying 
differentially expressed genes, it does suffer from several perceived disadvantages: 

(1) It may have a strong bias towards high copy number mRNAs (Bertioli et al. 
1995), although this has been disputed (Wan etal. 1996) and the isolation of very 
low abundance genes may be achieved in certain circumstances (Guimeraes et 
al. 1995a). 

(2) The cDNAs obtained often only represent the extreme 3' end of the mRN A 
(often the 3 '-untranslated region), although this may not always be the case 
(Guimeraes e£ al. 1995a). Since the 3 'end is often not included in Genbank and 
shows variation between organisms, cDNAs identified by DD cannot always be 
matched with their genes, even if they have been identified. 

(3) The pattern of differential expression seen on the display often cannot be 
reproduced on Northern blots, with false positives arising in up to 70% of cases 
(Sun et al. 1994). Some adaptations have been shown to reduce false positives, 
including the use of two reverse transcriptases (Sung and Denman 1997), 
comparison of uninduced and induced cells over a time course (Burn et al. 1 994) 
and comparison of DDPCR-products from two uninduced and two induced 
lines (Sompayrac et al. 1995). The latter authors also reported that the use of 
cytoplasmic RNA rather then total RNA reduces false positives arising from 
nuclear RNA that is not transported to the cytoplasm. 

Further details of the background, strengths and weaknesses of the DD 
technique can be obtained from a review by McClelland et al. (1996) and from 
articles by Liang et al. (1995) and Wan et al. (1996). 
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Figure 8. Two approaches to differential display (DD) analysis. 1 st strand synthesis can be carried out 
either with a polydT,, NN primer (where N = G, C or A) or with an arbitrary primer. The use of 
different combinations of G, C and A to anchor the first strand polydT primer enables the priming 
of the majority of polyadenylated mRNAs. Arbitrary primers may hybridize at none, one or more 
places along the length of the mRNA, allowing 1 st strand cDNA synthesis to occur at none, one 
or more points in the same gene. In both cases, 2 nd strand synthesis is carried out with an arbitrary 
primer. Since these arbitrary primers for the 2 nd strand may also hybridize to the 1 st strand cDNA 
in a number of different places, several different 2 nd strand products may be obtained from one 
binding point of the 1 st strand primer. Following 2 nd strand synthesis, the original set of primers 
is used to amplify the second strand products, with the result that numerous gene sequences are 
amplified. 



Restriction endonuclease-facilitated analysis of gene expression 

Serial Analysis of Gene Expression (SAGE) 

A more recent development in the field of differential display is SAGE analysis 
(Velculescu et al. 1995). This method uses a different approach to those discussed so 
far and is based on two principles. Firstly, in more than 95% of cases, short 
nucleotide sequences (' tags') of only nine or 10 base pairs provide sufficient 
information to identify their gene of origin. Secondly, concatenation (linking 
together in a series) of these tags allows sequencing of multiple cDNAs within a 
single clone. Figure 9 shows a schematic representation of the SAGE process. In this 
procedure, double stranded cDNA from the test cells is synthesized with a 
biotinylated polydT primer. Following digestion with a commonly cutting (4bp 
recognition sequence) restriction enzyme ('anchoring enzyme'), the 3' ends of the 
cDNA population are captured with streptavidin beads. The captured population is 
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split into two and different adaptors ligated to the 5 'ends of each group. Incorporated 
into the adaptors is a recognition sequence for a type IIS restriction enzyme — one 
which cuts DNA at a denned distance (< 20 bp) from its recognition sequence. 
Hence, following digestion of each captured cDNA population with the IIS enzyme, 
the adaptors plus a short piece of the captured cDNA are released. The two 
populations are then ligated and the products amplified. The amplified products are 
cleaved with the original anchoring enzyme, religated (concatomers are formed in 
the process) and cloned. The advantage of this system is that hundreds of gene tags 
can be identified by sequencing only a few clones. Furthermore, the number of times 
a given transcript is identified is a quantitative measurement of that gene's 
abundance in the original population, a feature which facilitates identification of 
differentially expressed genes in different cell populations. 

Some disadvantages of SAGE analysis include the technical difficulty of the 
method, a large amount of accurate sequencing is required, biased towards abundant 
mRNAs, has not been validated in the pharmaco/toxicogenomic setting and has 
only been used to examine well known tissue differences to date. 

Gene Expression Fingerprinting ( GEF ) 

A different capture/restriction digest approach for isolating differentially 
expressed genes has been described by Ivanova and Belyavsky (1995). In this 
method, RNA is converted to cDNA using biotinylated oligo(dT) primers. The 
cDNA population is then digested with a specific endonuclease and captured with 
magnetic streptavidin microbeads to facilitate removal of the unwanted 5 'digestion 
products. The use of restricted 3 '-ends alone serves to reduce the complexity of the 
cDNA fragment pool and helps to ensure that each RNA species is represented by 
not more than one restriction product. An adaptor is ligated to facilitate subsequent 
amplification of the captured population. PCR is carried out with one adaptor- 
specific and one biotinylated polydT primer. The reamplified population is 
recaptured and the non-biotinylated strands removed by alkaline dissociation. The 
non-biotinylated strand is then resynthesized using a different adaptor-specific 
primer in the presence of a radiolabeled dNTP. The labelled immobilized 3' cDN A 
ends are next sequentially treated with a series of different restriction endonucleases 
and the products from each digestion analysed by PAGE. The result is a fingerprint 
composed of a number of ladders (equal to the number of sequential digests used). 
By comparing test versus control fingerprints, it is possible to identify differentially 
expressed products which can then be isolated from the gel and cloned. The 
advantages of this procedure are that it is very robust and reproducible, and the 
authors estimate that 80-93% of cDNA molecules are involved in the final 
fingerprint. The disadvantage is that polyacrylamide gels can rarely resolve more 
than 300-400 bands, which compares poorly to the 1000 or more which are 
estimated to be produced in an average experiment. The use of 2-D gels such as 
those described by Uitterlinden et al. (1989) and Hatada et al. (1991) may help to 
overcome this problem. 

A similar method for displaying restriction endonuclease fragments was later 
described by Prashar and Weissman (1996). However, instead of sequential 
digestion of the immobolized 3 , -terminal cDNA fragments, these authors simply 
compared the profiles of the control and treated populations without further 
manipulation. 
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Figure 9. Serial analysis of gene expression (SAGE) analysis. cDNA is cleaved with an anchoring enzyme 
(AE)and the 3'ends captured using streptavidin beads. ThecDNA pool is divided in half and each 
portion ligated to a different linker, each containing a type IIS restriction site (tagging enzyme, 
TE). Restriction with the type IIS enzyme releases the linker plus a short length of cDNA 
(XXXXX and OOOOO indicate nucleotides of different tags). The two pools of tags are then 
ligated and amplified using linker-specific primers. Following PCR, the products are cleaved with 
the AE and the ditags isolated from the linkers using PAGE. The ditags are then ligated (during 
which process, eon eaten ization occurs) and cloned into a vector of choice for sequencing. After 
Velculescu et al. (1995), with permission. 
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DNA arrays 

'Open' differential display systems are cumbersome in that it takes a great deal 
of time to extract and identify candidate genes and then confirm that they are indeed 
up- or down-regulated in the treated compared to the control tissue. Normally, the 
latter process is carried out using Northern blotting or RT-PCR. Even so, each of 
the aforementioned steps produce a bottleneck to the ultimate goal of rapid analysis 
of gene expression. These problems will likely be addressed by the development of 
so-called DNA arrays (e.g. Gress et al. 1992, Zhao et al. 1995, Schena et al 1996), 
the introduction of which has signalled the next era in differential gene expression 
analysis. DNA arrays consist of a gridded membrane or glass ' chips* containing 
hundreds or thousands of DNA spots, each consisting of multiple copies of part of 
a known gene. The genes are often selected based on previously proven involvement 
in oncogenesis, cell cycling, DNA repair, development and other cellular processes. 
They are usually chosen to be as specific as possible for each gene and animal species. 
Human and mouse arrays are already commercially available and a few companies 
will construct a personalized array to order, for example Clontech Laboratories and 
Research Genetics Inc. The technique is rapid in that hundreds or even thousands 
of genes can be spotted on a single array, and that mRNA /cDNA from the test 
populations can be labelled and used directly as probe. When analysed with 
appropriate hardware and software, arrays offer a rapid and quantitative means to 
assess differences in gene expression between two cell populations. Of course, there 
can only be identification and quantitation of those genes which are in the array 
(hence the term 'closed' system). Therefore, one approach to elucidating the 
molecular mechanisms involved in a particular disease /development system may be 
to combine an open and closed system — a DNA array to directly identify and 
quantitate the expression of known genes in mRNA populations, and an open 
system such as SSH to isolate unknown genes which are differentially expressed. 

One of the main advantages of DNA arrays is the huge number of gene fragments 
which can be put on a membrane — some companies have reported gridding up to 
60000 spots on a single glass 'chip' (microscope slide). These high density chip- 
based micro-arrays will probably become available as mass-produced off-the-shelf 
items in the near future. This should facilitate the more rapid determination of 
differential expression in time and dose-response experiments. Aside from their 
high cost and the technical complexities involved in producing and probing DNA 
arrays, the main problem which remains, especially with the newer micro-array 
(gene-chip) technologies, is that results are often not wholly reproducible between 
arrays. However, this problem is being addressed and should be resolved within the 
next few years. 



EST databases as a means to identify differentially expressed genes 

Expressed sequence tags (ESTs) are partial sequences of clones obtained from 
cDNA libraries. Even though most ESTs have no formal identity (putative 
identification is the best to be hoped for), they have proven to be a rapid and efficient 
means of discovering new genes and can be used to generate profiles of gene- 
expression in specific cells. Since they were first described by Adams et al. (1991), 
there has been a huge explosion in EST production and it is estimated that there are 
now well over a million such sequences in the public domain, representing over half 



Differential gene expression 



675 



of all human genes (Hillier et aL 1996). This large number of freely available 
sequences (both sequence information and clones are normally available royalty-free 
from the originators) has enabled the development of a new approach towards 
differential gene expression analysis as described by Vasmatzis et al. (1998). The 
approach is simple in theory: EST databases are first searched for genes that have a 
number of related EST sequences from the target tissue of choice, but none or few 
from non-target tissue libraries. Programmes to assist in the assembly of such sets of 
overlapping data may be developed in-house or obtained privately or from the 
internet. For example, the Institute for Genomic Research (TIGR, found at 
http:/ ywww.tigr.org) provides many software tools free of charge to the scientific 
community. Included amongst these is the TIGR assembler (Sutton et aL 1995), a 
tool for the assembly 'of large sets of overlapping data such as ESTs, bacterial 
artificial chromosomes (BAC)s, or small genomes. Candidate EST clones repre- 
senting different genes are then analysed using RN A blot methods for size and tissue 
specificity and, if required, used as probes to isolate and identify the full length 
cDNA clone for further characterization. In practice however, the method is rather 
more involved, requiring bioinformatic and computer analysis coupled with 
confirmatory molecular studies. Vasmatzis et al. (1998) have described several 
problems in this fledgling approach, such as separating highly homologous 
sequences derived from different genes and an overemphasis of specificity for some 
EST sequences. However, since these problems will largely be addressed by the 
development of more suitable computer algorithms and an increased completeness 
of the EST database, it is likely that this approach to identifying differentially 
expressed genes may enjoy more patronage in the future. 



Problems and potential of differential expression techniques 

The holistic or single cell approach ? 

When working with in vivo models of differential expression, one of the first 
issues to consider must be the presence of multiple cell types in any given specimen. 
For example, a liver sample is likely to contain not only hepatocytes, but also 
(potentially) Ito cells, bile ductule cells, endothelial cells, various immune cells (e.g. 
lymphocytes, macrophages and Kupffer cells) and fibroblasts. Other tissues will 
each have their own distinctive cell populations. Also, in the case of neoplastic tissue, 
there are almost always normal, hyperplastic and/or dysplastic cells present in a 
sample. One must, therefore, be aware that genes obtained from a differential 
display experiment performed on an animal tissue model may not necessarily arise 
exclusively from the intended target' cells, e.g. hepatocytes/neoplastic cells. If 
appropriate, further analyses using immunohistochemistry, in situ hybridization or 
in situ RT-PCR should be used to confirm which cell types are expressing the 
gene(s) of interest. This problem is probably most acute for those studying the 
differential expression of genes in the development of different cell types, where 
there is a need to examine homologous cell populations. The problem is now being 
addressed at the National Cancer Institute (Bethesda, MD, USA) where new micro- 
disection techniques have been employed to assist in their gene analysis programme, 
the Cancer Genome Anatomy Project (CG AP) (For more information see web site: 
http : / ywww.ncbi.nlm.nih.gov/ncicgap /intro.html). There are also separation tech- 
niques available that utilise cell-specific antigens as a means to isolate target cells, 
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e.g. fluorescence activated cell sorting (FACS) (Dunbar et al. 1998, Kas-Deelen et 
al. 1998) and magnetic bead technology (Richard et al 1998, Rogler et al. 1998). 

However, those taking a holistic approach may consider this issue unimportant. 
There is an equally appropriate view that all those genes showing altered expression 
within a compromized tissue should be taken into consideration. After all, since all 
tissues are complex mixes of different, interacting cell types which intimately 
regulate each other's growth and development, it is clear that each cell type could in 
some way contribute (positively or negatively) towards the molecular mechanisms 
which lie behind responses to external stimuli or neoplastic growth. It is perhaps 
then more informative to carry out differential display experiments using in vivo as 
opposed to in vitro models, where uniform populations of identical cells probably 
represent a partial, skewed or even inaccurate picture of the molecular changes that 
occur. 

The incidence and possible implications of inter-individual biological variation 
should be considered in any approach where whole animal models are being used. It 
is clear that individuals (humans and animals) respond in different ways to identical 
stimuli. One of the best characterized examples is the debrisoquine oxidation 
polymorphism, which is mediated by cytochrome CYP2D6 and determines the 
pharmacokinetics of many commonly prescribed drugs (Lennard 1993, Meyer and 
Zanger 1997). The reasons for such differences are varied and complex, but allelic 
variations, regulatory region polymorphisms and even physical and mental health 
can all contribute to observed differences in individual responses. Careful thought 
should, therefore, be given to the specific objectives of the study and to the possible 
value of pooling starting material (tissue/mRNA). The effect of this can be 
beneficial through the ironing out of exaggerated responses and unimportant minor 
fluctuations of (mechanistically) irrelevant genes in individual animals, thus 
providing a clearer overall picture of the general molecular mechanisms of the 
response. However, at the same time such minor variations may be of utmost 
importance in deciding the ability of individual animals to succumb to or resist the 
effects of a given chemical/disease. 



How efficient are differential expression techniques at recovering a high percentage of 
differentially expressed genes? 

A number of groups have produced experimental data suggesting that mam- 
malian cells produce between 8000-15000 different mRNA species at any one time 
(Mechler and Rabbitts 1981, Hedrick et al. 1984, Bravo 1990), although figures as 
high as 20-30000 have also been quoted (Axel et al. 1976). Hedrick et al. (1984) 
provided evidence suggesting that the majority of these belong to the rare abundance 
class. A breakdown of this abundance distribution is shown in table 1. 

When the results of differential display experiments have been compared with 
data obtained previously using other methods, it is apparent that not all differentially 
expressed mRNAs are represented in the final display. In particular, rare messages 
(which, importantly, often include regulatory proteins) are not easily recovered 
using differential display systems. This is a major shortcoming, as the majority of 
mRNA species exist at levels of less than 0.005% of the total population (table 1). 
Bertioli et al (1995) examined the efficiency of DD templates (heterogeneous 
mRNA populations) for recovering rare messages and were unable to detect mRNA 
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species present at less than 1.2% of the total mRNA population — equivalent to an 
intermediate or abundant species. Interestingly, when simple model systems (single 
target only) were used instead of a heterogeneous mRNA population, the same 
primers could detect levels of target mRNA down to 10000X smaller. These results 
are probably best explained by competition for substrates from the many PCR 
products produced in a DD reaction. 

The numbers of differentially expressed mRNAs reported in the literature using 
various model systems provides further evidence that many differentially expressed 
mRNAs are not recovered. For example, DeRisi et al. (1997) used DNA array 
technology to examine gene expression in yeast following exhaustion of sugar in the 
medium, and found that more than 1700 genes showed a change in expression of at 
least 2-fold. In light of such a finding, it would not be unreasonable to suggest that 
of the 8000-15 000 different mRNA species produced by any given mammalian cell, 
up to 1000 or more may show altered expression following chemical stimulation. 
Whilst this may be an extreme figure, it is known that at least 100 genes are 
activated/upregulated in Jurkat (T-) cells following IL-2 stimulation (Ullman et al. 
1990). In addition, Wan et al. (1996) estimated that interferon- stimulated HeLa 
cells differentially express up to 433 genes (assuming 24000 distinct mRNAs 
expressed by the cells). However, there have been few publications documenting 
anywhere near the recovery of these numbers. For example, in using DD to compare 
normal and regenerating mouse liver, Bauer et al. (1993) found only 70 of 38000 
total bands to be different. Of these, 50% (35 genes) were shown to correspond to 
differentially expressed bands. Chen et al. (1996) reported 10 genes upregulated in 
female rat liver following ethinyl estradiol treatment. McKenzie and Drake (1997) 
identified 14 different gene products whose expression was altered by phorbol 
myristate acetate (PMA, a tumour promoter agent) stimulation of a human 
myelomonocytic cell line. Kilty and Vickers (1997) identified 10 different gene 
products whose expression was upregulated in the peripheral blood leukocytes of 
allergic disease sufferers. Linskens et al. (1995) found 23 genes differentially 
expressed between young and senescent fibroblasts. Techniques other than DD 
have also provided an apparent paucity of differentially expressed genes. Using SH 
for example, Cao et al. (1997) found 15 genes differentially expressed in colorectal 
cancer compared to normal mucosal epithelium. Fitzpatrick et al. (1995) isolated 17 
genes upregulated in rat liver following treatment with the peroxisome proliferator, 
clofibrate; Philips et al. (1990) isolated 12 cDNA clones which were upregulated in 
highly metastatic mammary adenocarcinoma cell lines compared to poorly meta- 
static ones. Prashar and Weissman (1996) used 3' restriction fragment analysis and 
identified approximately 40 genes showing altered expression within 4 h of 
activation of Jurkat T-cells. Groenink and Leegwater (1996) analysed 27 gene 
fragments isolated using SSH of delayed early response phase of liver regeneration 
and found only 12 to be upregulated. 

In the laboratory, SSH was used to isolate up to 70 candidate genes which appear 
to show altered expression in guinea pig liver following short-term treatment with 
the peroxisome proliferator, WY-14,643 (Rockett, Swales, Esdaile and Gibson, 
unpublished observations). However, these findings have still to be confirmed by 
analysis of the extracted tissue mRNA for differential expression of these sequences. 

Whilst the latest differential display technologies are purported to include design 
and experimental modifications to overcome this lack of efficiency (in both the total 
number of differentially expressed genes recovered and the percentage that are true 
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positives), it is still not clear if such adaptations are practically effective — proving 
efficiency by spiking with a known amount of limited numbers of artificial 
construct(s) is one thing, but isolating a high percentage of the rare messages already 
present in an mRNA population is another. Of course, some models will genuinely 
produce only a small number of differentially expressed genes. In addition, there are 
also technical problems that can reduce efficiency. For example, mRNAs may have 
an unusual primary structure that effectively prevents their amplification by PCR- 
based systems. In addition, it is known that under certain circumstances not all 
mRNAs have 3 'poly A sites. For example, during Xenopus development, deadenyl- 
ation is used as a means to stabilize RNAs (Voeltz and Steitz 1998), whilst 
preferential deadenylation may play a role in regulating Hsp70 (and perhaps, 
therefore, other stress protein) expression in Drosophila (Dellavalle et aL 1994). The 
presence of deadenylated mRNAs would clearly reduce the efficiency of systems 
utilizing a polydT reverse transcription step. The efficiency of any system also 
depends on the quality of the starting material. All differential display techniques 
use mRNA as their target material. However, it is difficult to isolate mRNA that is 
completely free of ribosomal RNA. Even if polydT primers are used to prime first 
strand cDNA synthesis, ribosomal RNA is often transcribed to some degree 
(Clontech PCR-Select cDNA Subtraction kit user manual). It has been shown, at 
least in the case of SSH, that a high rRNA:mRNA ratio can lead to inefficient 
subtractive hybridization (Clontech PCR-Select cDNA Subtraction kit user 
manual), and there is no reason to suppose that it will not do likewise in other SH 
approaches. Finally, those techniques that utilise a presubtraction amplification step 
(e.g. RDA) may present a skewed representation since some sequences amplify 
better than others. 

Of course, probably the most important consideration is the temporal factor. It 
is clear that any given differential display experiment can only interrogate a cell at 
one point in time. It may well be that a high percentage of the genes showing altered 
expression at that time are obtained. However, given that disease processes and 
responses to environmental stimuli involve dynamic cascades of signalling, 
regulation, production and action, it is clear that all those genes which are switched 
on/off at different times will not be recovered and, therefore, vital information may 
well be missed. It is, therefore, imperative to obtain as much information about the 
model system beforehand as possible, from which a strategy can be derived for 
targeting specific time points or events that are of particular interest to the 
investigator. One way of getting round this problem of single time point analysis is 
to conduct the experiment over a suitable time course which, of course, adds 
substantially to the amount of work involved. 



How sensitive are differential expression technologies? 

There has been little published data that addresses the issue of how large the 
change in expression must be for it to permit isolation of the gene in question with 
the various differential expression technologies. Although the isolation of genes 
whose expression is changed as little as 1.5-fold has been reported using SSH 
(Groenink and Leegwater 1996), it appears that those demonstrating a change in 
excess of 5-fold are more likely to be picked up. Thus, there is a 'grey zone' 
in between where small changes could fade in and out of isolation between 
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experiments and animals. DD, on the other hand, is not subject to this grey 
zone since, unlike SH approaches, it does not amplify the difference in expression 
between two samples. Wan et aL (1996) reported that differences in expression of 
twofold or more are detectable using DD. 

Resolution and visualization of differential expression products 

It seems highly improbable with current technology that a gel system could be 
developed that is able to resolve all gene species showing altered expression in any 
given test system (be it SH- or DD-based). Polyacrylamide gel electrophoresis 
(PAGE) can resolve size differences down to 0.2% (Sambrook et aL 1989) and are 
used as standard in DD experiments. Even so, it is clear that a complex series of gene 
products such as those seen in a DD will contain unresolvable components. Thus, 
what appears to be one band in a gel may in fact turn out to be several. Indeed, it has 
been well documented (Mathieu-Daude et al. 1996, Smith et aL 1997) that a single 
band extracted from a DD often represents a composite of heterogeneous products, 
and the same has been found for SSH displays in this laboratory (Rockett et aL 
1997). One possible solution was offered by Mathieu-Daude et aL (1996), who 
extracted and reamplified candidate bands from a DD display and used single strand 
conformation polymorphism (SSCP) analysis to confirm which components 
represented the truly differentially expressed product. 

Many scientists often try to avoid the use of PAGE where possible because it is 
technically more demanding than agarose gel electrophoresis (AGE). Unfortunately, 
high resolution agarose gels such as Metaphor (FMC, Lichfield, UK) and AquaPor 
HR (National Diagnostics, Hessle, UK), whilst easier to prepare and manipulate 
than PAGE, can only separate DNA sequences which differ in size by around 
1.5-2% (15-20 base pairs for a 1Kb fragment). Thus, SSH, RDA or other such 
products which differ in size by less than this amount are normally not resolvable. 
However, a simple technique does in fact exist for increasing the resolving power of 
AGE— the inclusion of HA-red (10-phenyl neutral red-PEG ligand) or HA-yellow 
(bisbenzamide-PEG ligand) (Hanse Analytik GmbH, Bremen, Germany) in a 
gel separates identical or closely sized products on base content. Specifically, 
HA-red and -yellow selectively bind to GC and AT DNA motifs, respectively 
(Wawer et aL 1995, Hanse Analytik 1997, personal communication). Since both 
HA-stains possess an overall positive charge, they migrate towards the cathode 
when an electric field is applied. This is in direct opposition to DNA, which 
is negatively charged and, therefore, migrates towards the anode. Thus, if two 
DNA clones are identical in size (as perceived on a standard high resolution 
agarose gel), but differ in AT/GC content, inclusion of a HA-dye in the gel 
will effectively retard the migration of one of the sequences compared to the 
other, effectively making it apparently larger and, thus, providing a means of 
differentiating between the two. The use of HA-red has been shown to resolve 
sequences with an AT variation of less than 1% (Wawer et aL 1995), whilst Hanse 
Analytik have reported that HA staining is so sensitive that in one case it was used 
to distinguish two 567bp sequences which differed by only a single point mutation 
(Hanse Analytik 1996, personal communication). Therefore, if one wishes to check 
whether all the clones produced from a specific band in a differential display 
experiment are derived from the same gene species, a small amount of reamplified 
or digested clone can be run on a standard high resolution gel, and a second aliquot 
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Figure 10. Discrimination of clones of identical/nearly identical size using HA-red. Bands of decreasing 
size (1-5) were extracted from the final display of a suppression sub tractive hybridization 
experiment and cloned. Seven colonies were picked at random from each cloned band and their 
inserts amplified using PCR. The products were run on two gels, (A) a high resolution 2% agarose 
gel, and (B) a high resolution 2% agarose gel containing 1 U/ml HA-red. With few exceptions, all 
the clones from each band appear to be the same size (gel A). However, the presence of HA-red 
(gel B), which separates identically-sized DNA fragments based on the percentage of GC within 
the sequence, clearly indicates the presence of different gene species within each band. For 
example, even though all five re-amplified clones of band 1 appear to be the same size, at least four 
different gene species are represented. 



in a similar gel containing one of the HA-stains. The standard gel should indicate 
any gross size differences, whilst the HA-stained gel should separate otherwise 
unresolvable species (on standard AGE) according to their base content. Geisinger 
et al. (1997) reported successful use of this approach for identifying DD-derived 
clones. Figure 10 shows such an experiment carried out in this laboratory on clones 
obtained from a band extracted from an SSH display. 

An alternative approach is to carry out a 2-D analysis of the differential display 
products. In this approach, size-based separation is first carried out in a standard 
agarose gel. The gel slice containing the display is then extracted and incorporated 
in to a HA gel for resolution based on AT/GC content. 

Of course, one should always consider the possibility of there being different 
gene species which are the same size and have the same GC/AT content. However, 
even these species are not unresolvable given some effort — again, one might use 
SSCP, or perhaps a denaturing gradient gel electrophoresis (DGGE) or temperature 
gradient field electrophoresis (TGGE) approach to resolve the contents of a band, 
either directly on the extracted band (Suzuki et al. 1991) or on the reamplified 
product. 

The requirement of some differential display techniques to visualize large 
numbers of products (e.g. DD and GEF) can also present a problem in that, in terms 
of numbers, the resolution of PAGE rarely exceeds 300-400 bands. One approach to 
overcoming this might be to use 2-D gels such as those described by Uitterlinden et 
al. (1989) and Hatada et aL (1991). 
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Extraction of differentially expressed bands from a gel can be complex since, in 
some cases (e.g. DD, GEF), the results are visualized by autoradiographic means, 
such that precise overlay of the developed film on the gel must occur if the correct 
band is to be extracted for further analysis. Clearly, a misjudged extraction can 
account for many man-hours lost. This problem , and that of the use of radioisotopes , 
has been addressed by several groups. For example, Lohmann et al. (1995) 
demonstrated that silver staining can be used directly to visualize DD bands in 
horizontal PAGs. An et al. (1996) avoided the use of radioisotopes by transferring a 
small amount (20-30%) of the DNA from their DD to a nylon membrane, and 
visualizing the bands using chemiluminescent staining before going back to extract 
the remaining DNA from the gel. Chen and Peck (1996) went one step further and 
transferred the entire DD to a nylon membrane. The DNA bands were then 
visualized using a digoxigenin (DIG) system (DIG was attached to the polydT 
primers used in the differential display procedure). Differentially expressed bands 
were cut from the membrane and the DNA eluted by washing with PCR buffer prior 
to reamplification. 

One of the advantages of using techniques such as SSH and RD A is that the final 
display can be run on an agarose gel and the bands visualized with simple ethidium 
bromide staining. Whilst this approach can provide acceptable results, overstaining 
with S YBR Green I or SYBR Gold nucleic acid stains (FMC) effectively enhances 
the intensity and sharpness of the bands. This greatly aids in their precise extraction 
and often reveals some faint products that may otherwise be overlooked. Whilst 
differential displays stained with SYBR Green I are better visualized using short 
wavelength UV (254 nm) rather than medium wavelength (306 nm), the shorter 
wavelength is much more DNA damaging. In practice, it takes only a few seconds 
to damage DNA extracted under 254 nm irradiation, effectively preventing 
reamplification and cloning. The best approach is to overstain with SYBR Green I 
and extract bands under a medium wavelength UV transillumination. 

The possible use of 'microfingerprinting * to reduce complexity 

Given the sheer number of gene products and the possible complexity of each 
band, an alternative approach to rapid characterization may be to use an enhanced 
analysis of a small section of a differential display — a * sub-fingerprint * or 'micro- 
fingerprint\ In this case, one could concentrate on those bands which only appear 
in a particular chosen size region. Reducing the fingerprint in this way has at least 
two advantages. One is that it should be possible to use different gel types, 
concentrations and run times tailored exactly to that region. Currently, one might 
run products from 100-3000 +bp on the same gel, which leads to compromize in the 
gel system being used and consequently to suboptimal resolution, both in terms of 
size and numbers, and can lead to problems in the accurate excision of individual 
bands. Secondly, it may be possible to enhance resolution by using a 2-D analysis 
using a HA-stain, as described earlier. In summary, if a range of gene product sizes 
is carefully chosen to included certain * relevant* genes, the 2-D system standardized, 
and appropriate gene analysis used, it may be possible to develop a method for the 
early and rapid identification of compounds which have similar or widely different 
cellular effects. If the prognosis for exposure to one or more other chemicals which 
display a similar profile is already known, then one could perhaps predict similar 
effects for any new compounds which show a similar micro-fingerprint. 
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An alternative approach to microfingerprinting is to examine altered expression 
in specific families of genes through careful selection of PCR primers and /or post- 
reaction analysis. Stress genes, growth factors and /or their receptors, cell cycling 
genes, cytochromes P450 and regulatory proteins might be considered as candidates 
for analysis in this way. Indeed, some off-the-shelf DNA arrays (e.g. Clontech's 
Atlas cDNA Expression Array series) already anticipated this to some degree by 
grouping together genes involved in different responses e.g. apoptosis, stress, DNA- 
damage response etc. 



Screening 

False positives 

The generation of false positives has been discussed at length amongst the 
differential display community (Liang et al. 1993, 1995, NishioeZa/. 1994, Sun et al. 
1994, Sompayrac et al. 1995). The reason for false positives varies with the 
technique being used. For instance, in RDA, the use of adaptors which have not 
been HPLC purified can lead to the production of false positives through illegitimate 
ligation events (O'Neill and Sinclair 1997), whilst in DD they can arise through 
PCR artifacts and illegitimate transcription of rRNA. In SH, false positives appear 
to be derived largely from abundant gene species, although some may arise from 
cDNA/mRNA species which do not undergo hybridization for technical reasons. 

A quick screening of putative differentially expressed clones can be carried out 
using a simple dot blot approach, in which labelled first strand probes synthesized 
from tester and driver mRN A are hybridized to an array of said clones (Hedrick et 
al. 1984, Sakaguchi et al. 1986). Differentially expressed clones will hybridize to 
tester probe, but not driver. The disadvantage of this approach is that rare species 
may not generate detectable hybridization signals. One option for those using SSH 
is to screen the clones using a labelled probe generated from the subtracted cDNA 
from which it was derived, and with a probe made from the reverse subtraction 
reaction (ClonTechniques 1997a). Since the SSH method enriches rare sequences, 
it should be possible to confirm the presence of clones representing low abundance 
genes. Despite this quick screening step, there is still the need to go back to the 
original mRNA and confirm the altered expression using a more quantitative 
approach. Although this may be achieved using Northern blots, the sensitivity is 
poor by today's high standards and one must rely on PCR methods for accurate and 
sensitive determinations (see below). 



Sequence analysis 

The majority of differential display procedures produce final products which are 
between 100 and lOOObp in size. However, this may considerably reduce the size of 
the sequence for analysis of the DNA databases. This in turn leads to a reduced 
confidence in the result — several families of genes have members whose DNA 
sequences are almost identical except in a few key stretches, e.g. the cytochrome 
P450 gene superfamily (Nelson et al. 1996). Thus, does the clone identified as being 
almost identical to gene X 0 really come from that gene, or its brother gene X, or its 
as yet undiscovered sister X 2 ? For example, using SSH, part of a gene was isolated, 
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which was up-regulated in the liver of rats exposed to Wy-14,643 and was identified 
by a FASTA search as being transferrin (data not shown). However, transferrin is 
known to be downregulated by hypolipidemic peroxisome proliferators such as Wy- 
14,643 (Hertz et al. 1996), and this was confirmed with subsequent RT-PCR 
analysis. This suggests that the gene sequence isolated may belong to a gene which 
is closely related to transferrin, but is regulated by a different mechanism. 

A further problem associated with SH technology is redundancy. In most cases 
before SH is carried out, the cDN A population must first be simplified by restriction 
digestion. This is important for at least two reasons: 

(1) To reduce complexity — long cDNA fragments may form complex networks 
which prevent the formation of appropriate hybrids, especially at the high 
concentrations required for efficient hybridization. 

(2) Cutting the cDNAs into small fragments provides better representation of 
individual genes. This is because genes derived from related but distinct 
members of gene families often have similar coding sequences that may cross- 
hybridize and be eliminated during the subtraction procedure (Ko 1990). 
Furthermore, different fragments from the same cDN A may differ considerably 
in terms of hybridization and amplification and, thus, may not efficiently do one 
or the other (Wang and Brown 1991). Thus, some fragments from differentially 
expressed cDNAs may be eliminated during subtractive hybridization pro- 
cedures. However, other fragments may be enriched and isolated. As a 
consequence of this, some genes will be cut one ur mure limes, giving rise to two 
or more fragments of different sizes. If those same genes are differentially 
expressed, then two or more of the different size fragments may come through 
as separate bands on the final differential display, increasing the observed 
redundancy and increasing the number of redundant sequencing reactions. 

Sequence comparisons also throw up another important point — at what degree 
of sequence similarity does one accept a result. Is 90% identitiy between a gene 
derived from your model species and another acceptably close? Is 95% between 
your sequence and one from the same species also acceptable? This problem is 
particularly relevant when the forward and reverse sequence comparisons give 
similar sequences with completely different gene species! An arbitrary decision 
seems to be to allocate genes that are definite (95% and above similarity) and then 
group those between 60 and 95% as being related or possible homologues. 

Quantitative analysis 

At some point, one must give consideration to the quantitative analysis of the 
candidate genes, either as a means of confirming that they are truly differentially 
expressed, or in order to establish just what the differences are. Northern blot 
analysis is a popular approach as it is relatively easy and quick to perform. However, 
the major drawback with Northern blots is that they are often not sensitive enough 
to detect rare sequences. Since the majority of messages expressed in a cell are of low 
abundance (see table 1), this is a major problem. Consequently, RT-PCR may be the 
method of choice for confirming differential expression. Although the procedure is 
somewhat more complex than Northern analysis, requiring synthesis of primers and 
optimization of reaction conditions for each gene species, it is now possible to set up 
high throughput PCR systems using mulitchannel pipettes, 96 +-well plates and 
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appropriate thermal cycling technology. Whilst quantitative analysis is more 
desirable, being more accurate and without reliance on an internal standard, the 
money and time needed to develop a competitor molecule is often excessive, 
especially when one might be examining tens or even hundreds of gene species. The 
use of semi-quantitative analysis is simpler, although still relatively involved. One 
must first of all choose an internal standard that does not change in the test cells 
compared to the controls. Numerous reference genes have been tried in the past, for 
example interferon-gamma (IFN-p, Frye et al. 1989), /J-actin (Heuval et al. 1994), 
glyceraldehyde-3-phosphate dehydrogenase (GAPDH, Wong et al 1994), di- 
hydrofolate reductase (DHFR, Mohler and Butler 1991), /^-microglobulin (£-2- 
m, Murphy et al 1990), hypoxanthine phosphoribosyl transferase (HPRT, Foss et 
al. 1998) and a number of others (ClonTechniques 1997b). Ideally, an internal 
standard should not change its level of expression in the cell regardless of cell age, 
stage in the cell cycle or through the effects of external stimuli. However, it has been 
shown on numerous occasions that the levels of most housekeeping genes currently 
used by the research community do in fact change under certain conditions and in 
different tissues (ClonTechniques 1997b). It is imperative, therefore, that pre- 
liminary experiments be carried out on a panel of housekeeping genes to establish 
their suitability for use in the model system. 

Interpretation of quantitative data must also be treated with caution. By 
comparing the lists of genes identified by differential expression one can perhaps 
gain insight into why two different species react in different ways to external stimuli. 
For example, rats and mice appear sensitive to the non-genotoxic effects of a wide 
range of peroxisome proliferators whilst Syrian hamsters and guinea pigs are largely 
resistant (Orton et al. 1984, Rodricks and Turnbull 1987, Lake et al. 1989, 1993, 
Makowska et al. 1992). A simplified approach to resolving the reason(s) why is to 
compare lists of up- and down-regulated genes in order to identify those which are 
expressed in only one species and, through background knowledge of the effects of 
the said gene, might suggest a mechanism of facilitated non-genotoxic carcinogenesis 
or protection. Of course, the situation is likely to be far more complex. Perhaps if 
there were one key gene protecting guinea pig from non-genotoxic effects and it was 
upregulated 50 times by PPs, the same gene might only be up-regulated five times 
in the rat. However, since both were noted to be upregulated, the importance of the 
gene may be overlooked. Just to complicate matters, a large change in expression 
does not necessarily mean a biologically important change. For example, what is the 
true relevance of gene Y which shows a 50-fold increase after a particular treatment, 
and gene Z which shows only a 5-fold increase? If one examines the literature one 
may find that historically, gene Y has often been shown to be up-regulated 40-60- 
fold by a number of unrelated stimuli — in light of this the 50-fold increase would 
appear less significant. However, the literature may show that gene Z has never been 
recorded as having more than doubled in expression — which makes your 5 -fold 
increase all the more exciting. Perhaps even more interesting is if that same 5-fold 
increase has only been seen in related neoplasms or following treatment with related 
chemicals. 

Problems in using the differential display approach 

Differential display technology originally held promise of an easily obtainable 
' fingerprint * of those genes which are up- or down-regulated in test animals /cells in 
a developmental process or following exposure to given stimuli. However, it has 
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become clear that the fingerprinting process, whilst still valid, is much too complex 
to be represented by a single technique profile. This is because all differential display 
techniques have common and/or unique technical problems which preclude the 
isolation and identification of all those genes which show changes in expression. 
Furthermore, there are important genetic changes related to disease development 
which differential expression analysis is simply not designed to address. An example 
of this is the presence of small deletions, insertions, or point mutations such as those 
seen in activated oncogenes, tumour suppressor genes and individual poly- 
morphisms. Polymorphic variations, small though they usually are, are often 
regarded as being of paramount importance in explaining why some patients 
respond better than others to certain drug treatments (and, in logical extension, why 
some people are less affected by potentially dangerous xenobiotics /carcinogens than 
others). The identification of such point mutations and naturally occurring 
polymorphisms requires the subsequent application of sequencing, SSCP, DGGE 
or TGGE to the gene of interest. Furthermore, differential display is not designed 
to address issues such as alternatively spliced gene species or whether an increased 
abundance of mRNA is a result of increased transcription or increased mRNA 
stability. 



Conclusions 

Perhaps the main advantage of open system differential display techniques is that 
they are not limited by extant theories or researcher bias in revealing genes which are 
differentially expressed, since they are designed to amplify all genes which 
demonstrate altered expression. This means that they are useful for the isolation of 
previously unknown genes which may turn out be useful biomarkers of a particular 
state or condition. At least one open system (SAGE) is also quantitative, thus 
eliminating the need to return to the original mRNA and carry out Northern /PCR 
analysis to confirm the result. However, the rapid progress of genome mapping 
projects means that over the next 5-10 years or so, the balance of experimental use 
will switch from open to closed differential display systems, particularly DNA 
arrays. Arrays are easier and faster to prepare and use, provide quantitative data, are 
suitable for high throughput analysis and can be tailored to look at specific signalling 
pathways or families of genes. Identification of all the gene sequences in human and 
common laboratory animals combined with improved DNA array technology, 
means that it will soon no longer be necessary to try to isolate differentially expressed 
genes using the technically more demanding open system approach. Thus, their 
main advantage (that of identifying unknown genes) will be largely eradicated. It is 
likely, therefore, that their sphere of application will be reduced to analysis of the 
less common laboratory species, since it will be some time yet before the genomes of 
such animals as zebrafish, electric eels, gerbils, crayfish and squid, for example, will 
be sequenced. 

Of course, in the end the question will always remain: What is the functional/ 
biological significance of the identified, differentially expressed genes? One 
persistent problem is understanding whether differentially expressed genes are a 
cause or consequence of the altered state. Furthermore, many chemicals, such as 
non-genotoxic carcinogens, are also mitogens and so genes associated with 
replication will also be upregulated but may have little or nothing to do with the 
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carcinogenic effect. Whilst differential display technology cannot hope to answer 
these questions, it does provide a springboard from which identification, regulatory 
and functional studies can be launched. Understanding the molecular mechanism of 
cellular responses is almost impossible without knowing the regulation and function 
of those genes and their condition (e.g. mutated). In an abstract sense, differential 
display can be likened to a still photograph, showing details of a fixed moment in 
time. Consider the Historian who knows the outcome of a battle and the placement 
and condition of the troops before the battle commenced, but is asked to try and 
deduce how the battle progressed and why it ended as it did from a few still 
photographs — an impossible task. In order to understand the battle, the Historian 
must find out the capabilities and motivation of the soldiers and their commanding 
officers, what the orders were and whether they were obeyed. He must examine the 
terrain, the remains of the battle and consider the effects the prevailing weather 
conditions exerted. Likewise, if mechanistic answers are to be forthcoming, the 
scientist must use differential display in combination with other techniques, such as 
knockout technology, the analysis of cell signalling pathways, mutation analysis and 
time and dose response analyses. Although this review has emphasized the 
importance of differential gene profiling, it should not be considered in isolation and 
the full impact of this approach will be strengthened if used in combination with 
functional genomics and proteomics (2-dimensional protein gels from isoelectric 
focusing and subsequent SDS electrophoresis and virtual 2D -maps using capillary 
electrophoresis). Proteomics is attracting much recent attention as many of the 
changes resulting in differential gene expression do not involve changes in mRNA 
levels, as decribed extensively herein, but rather protein-protein, protein-DNA and 
protein phosphorylation events which would require functional genomics or 
proteomic technologies for investigation. 

Despite the limitations of differential display technology, it is clear that many 
potential applications and benefits can be obtained from characterizing the genetic 
changes that occur in a cell during normal and disease development and in response 
to chemical or biological insult. In light of functional data, such profiling will 
provide a * fingerprint' of each stage of development or response, and in the long 
term should help in the elucidation of specific and sensitive biomarkers for different 
types of chemical /biological exposure and disease states. The potential medical and 
therapeutic benefits of understanding such molecular changes are almost im- 
measurable. Amongst other things, such fingerprints could indicate the family or 
even specific type of chemical an individual has been exposed to plus the length 
and/or acuteness of that exposure, thus indicating the most prudent treatment. 
They may also help uncover differences in histologically identical cancers, provide 
diagnostic tests for the earliest stages of neoplasia and, again, perhaps indicate the 
most efficacious treatment. 

The Human Genome Project will be completed early in the next century and the 
DNA sequence of all the human genes will be known. The continuing development 
and evolution of differential gene expression technology will ensure that this 
knowledge contributes fully to the understanding of human disease processes. 
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ABSTRACT The recent ability to sequence whole genomes 
allows ready access to all genetic material. The approaches 
outlined here allow automated analysis of sequence for the 
synthesis of optimal primers in an automated multiplex 
oligonucleotide synthesizer (AMOS). The efficiency is such 
that all ORFs for an organism can be amplified by PCR. The 
resulting amplicons can be used directly in the construction of 
DNA arrays or can be cloned for a large variety of functional 
analyses. These tools allow a replacement of single-gene 
analysis with a highly efficient whole-genome analysis. 



The genome sequencing projects have generated and will 
continue to generate enormous amounts of sequence data. The 
genomes of Saccharomyces cerevisiae, Escherichia coli, Hae- 
mophilus influenzae (1), Mycoplasma genitalium (2), and Meth- 
siys\yQ**s*itc i/i«M/ir s*i* it hsvs b^e* 1 r> or r, p^ct <a ^y sequ^ic-d 
Other model organisms have had substantial portions of their 
genomes sequenced as well, including the nematode Caeno- 
rhabditis elegans (4) and the small flowering plant Arabidopsis 
thaliana (5). This massive and increasing amount of sequence 
information allows the development of novel experimental 
approaches to identify gene function. 

One standard use of genome sequence data is to attempt to 
identify the functions of predicted open reading frames 
(ORFs) within the genome by comparison to genes of known 
function. Such a comparative analysis of all ORFs to existing 
sequence data is fast, simple, and requires no experimentation 
and is therefore a reasonable first step. While finding sequence 
homologies/motifs is not a substitute for experimentation, 
noting the presence of sequence homology and/or sequence 
motifs can be a useful first step in finding interesting genes, in 
designing experiments and, in some cases, predicting function. 
However, this type of analysis is frequently uninformative. For 
example, over one-half of new ORFs in S. cerevisiae have no 
known function (6). If this is the case in a well studied organism 
such as yeast, the problem will be even worse in organisms that 
are less well studied or less manipulable. A large, experiment 
tally determined gene function database would make homol- 
ogy/motif searches much more useful. 

Experimental analysis must be performed to thoroughly 
understand the biological function of a gene product. Scaling 
up from classical "cottage industry" one-gene-oriented ap- 
proaches to whole-genome analysis would be very expensive 
and laborious. It is clear that novel strategies are necessary to 
efficiently pursue the next phase of the genome projects — 
whole-genome experimental analysis to explore gene expres- 
sion, gene product function, and other genome functions. 
Model organisms, such as S. cerevisiae, will be extremely 



The publication costs of this article were defrayed in part by page charge 
payment. This article must therefore be hereby marked "advertisement" in 
accordance with 18 U.S.C §1734 solely to indicate this fact. 

© 1997 by The National Academy of Sciences 0027-8424/97/948945-3S2.00/0 
PNAS is available online at http://www.pnas.org. 



important in the development of novel whole-genome analysis 
techniques and, subsequently, in improving our understanding 
of other more complex and less manipulable organisms. 

The genome sequence can be systematically used as a tool 
to understand ORFs, gene product function, and other ge- 
nome regions. Toward this end, a directed strategy has been 
developed for exploiting sequence information as a means of 
providing information about biological function (Fig. 1). Ef- 
forts have been directed toward the amplification of each 
predicted ORF or any other region of the genome ranging 
from a few base pairs to several kilobase pairs. There are many 
uses for these amplicons — they can be cloned into standard 
vectors or specialized expression vectors, or can be cloned into 
other specialized vectors such as those used for two-hybrid 
analysis. The amplicons can also be used directly by, for 
example, arraying onto glass for expression analysis, for DNA 
binding assays, or for any direct DNA assay (7). As a pilot 
study, synthetic primers were made on the 96-well automated 
multiplex oligonucleotide synthesizer (AMOS) instrument (8) 
(Fig. 2). These oligonucleotides were used to amplify each 
ORF on yeast chromosome V. The current version of this 
instrument can synthesize three plates of 96 oligonucleotides 
each (25 bases) in an 8-hr day. The amplification of the entire 
set of PCR products was then analyzed by gel electrophoresis 
(Fig. 3). Successful amplification of the proper length product 
on the first attempt was 95%. This project demonstrates that 
one can go directly from sequence information to biological 
analysis in a truly automated, totally directed manner. 

These amplicons can be incorporated directly in arrays or 
the amplicons can be cloned. If the amplicons are to be cloned, 
novel sequences can be incorporated at the 5' end of the 
oligonucleotide to facilitate cloning. One potential problem 
with cloning PCR products is that the cloned amplicons may 
contain sequence alterations that diminish their utility. One 
option would be to resequence each individual amplicon. 
However, this is expensive, inefficient, and time consuming. A 
faster, more cost-effective, and more accurate approach is to 
apply comparative sequencing by denaturing HPLC (9). This 
method is capable of detecting a single base change in a 2-kb 
heteroduplex. Longer amplicons can be analyzed by use of 
appropriate restriction fragments. If any change is detected in 
a clone, an alternate clone of the same region can be analyzed. 
Modifying the system to allow high throughput analysis by 
denaturing HPLC is also relatively simple and straightforward. 

If amplicons are used directly on arrays without cloning, it 
is important to note that, even if single PCR product bands are 
observed on gels, the PCR products will be contaminated with 
various amounts of other sequences. This contamination has 
the potential to affect the results in, for example, expression 
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Fig. 1. Overview of systematic method for isolating individual 
genes. Sequence information is obtained automatically from sequence 
databases. The data are input into primer selection software specifi- 
cally designed to target ORFs as designated by database annotations. 
The output file containing the primer information is directly read by 
a high-throughput oligonucleotide synthesizer, which makes the oli- 
gonucleotides in 96-well plates (AMOS, automated multiplex oligo- 
nucleotide synthesizer). The forward and reverse primers are synthe- 
sized in the same location on separate plates to facilitate the down- 
stream handling of primers. The amplicons are generated by PCR in 
96-well plates as well. 

analysis. On the other hand, direct use of the amplicons is 
much less labor intensive and greatly decreases the occurrence 
of mistakes in clone identification, a ubiquitous problem 
associated with large clone set archiving and retrieving. 

Any large-scale effort to capture each ORF within a genome 
must rely on automation if cost is to be minimized while 
efficiency is maximized. Toward that end, primers targeting 
ORFs were designed automatically using simple new scripts 
and existing primer selection software. These script-selected 
primer sequences were directly read by the high-throughput 
synthesizer and the forward and reverse primers were synthe- 
sized in separate plates in corresponding wells to facilitate 
automated pipetting and PCR amplifications. Each of the 
resulting PCR products, generated with minimum labor, con- 
tains a known, unique ORF. 

Large-scale genome analysis projects are dependent on 
newly emerging technologies to make the studies practical and 
economically feasible. For example, the cost of the primers, a 
significant issue in the past, has been reduced dramatically to 
make feasible this and other projects that require tens of 
thousands of oligonucleotides. Other methods of high- 
throughput analysis are also vital to the success of functional 
analysis projects, such as microarraying and oligonucleotide 
chip methods (10-14). 

Changes in attitude are also required. One of the major costs 
of commercial oligonucleotides is extensive quality control 
such that virtually 100% of the supplied oligonucleotides are 
successfully synthesized and work for their intended purpose. 
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Fig. 2. Overall approach for using database of a genome to direct 
biological analysis. The synthesis of the 6,000 ORFs (orfs) for each 
gene of S. cerevisiae can be used in many applications utilizing both 
cloning and microarraying technology. 

Considerable cost reduction can be obtained by simply de- 
creasing the expected successful synthesis rate to 95-97%. One 
can then achieve faster and cheaper whole genome coverage by 
simply adding a single quality control at the end of the 
experiment and batching the failures for resynthesis. 

The directed nature of the amplicon approach is of clear 
advantage. The sequence of each ORF is analyzed automati- 
cally, and unique specific primers are made to target each 
ORF. Thus, there is relatively little time or labor involved— for 
example, no random cloning and subsequent screening is 
required because each product is known. In the test system, 
primers for 240 ORFs from chromosome V were systematically 
synthesized, beginning from the left arm and continuing 
through to the right arm. At no point was there any manual 
analysis of sequence information to generate the collection. In 
many ways, now that the sequence is known, there is no need 
for the researcher to examine it. 

These amplicons can be arrayed and expression analysis can 
be done on all arrayed ORFs with a single hybridization (10). 
Those ORFs that display significant differential expression 
patterns under a given selection are easily identified without 
the laborious task of searching for and then sequencing a clone. 
Once scaled up, the procedure provides even greater returns 
on effort, because a single hybridization will ultimately provide 
a "snapshot" of the expression of all genes in the yeast genome. 
Thus, the limiting factor in whole genome analysis will not be 
the analysis process itself, but will instead be the ability of 
researchers to design and carry out experimental selections. 

Current expression and genetic analysis technologies are 
geared toward the analysis of single genes and are ill suited to 
analyze numerous genes under many conditions. Additional 
difficulties with current technologies include: the effort and 
expense required to analyze expression and make mutants, the 
potential duplication of effort if done by different laboratories, 
and the possibility of conflicting results obtained from differ- 
ent laboratories. In contrast, whole genome analysis not only 
is more efficient, it also provides data of much higher quality; 
all genes are assayed and compared in parallel under exactly 
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Fig. 3. Gel image of amplifications. Using the method described in Fig. 1, amplicons were generated for ORFs of S. cerevisiae chromosome 
V. One plate of 96 amplification reactions is shown. 



the same conditions. In addition, amplicons have many appli- 
cations beyond gene expression. For example, one recent 
approach is to incorporate a unique DNA sequence tag, 
synthesized as part of each gene specific primer, during 
amplification. The tags or molecular bar codes, when reintro- 
duced into the organism as a gene deletion or as a gene clone, 
can be used much more efficiently than individual mutations 
or clones because pools of tagged mutants or transformants 
can be analyzed in parallel. This parallel analysis is possible 
because the tags are readily and quantitatively amplified even 
in complex mixtures of tags (13). 

These ORF genome arrays and oligonucleotide tagged 
libraries can be used for many applications. Any conventional 
selection applied to a library that gives discrete or multiple 
products can use these technologies for a simple direct read- 
out. These include screens and selections for mutant comple- 
mentation, overexpression suppression (15, 16), second-site 
suppressors, synthetic lethality, drug target overexpression 
(17), two-hybrid screens (18), genome mismatch scanning (19), 
or recombination mapping. 

The genome projects have provided researchers with a vast 
amount of information. These data must be used efficiently 
and systematically to gain a truly comprehensive understand- 
ing of gene function and, more broadly, of the entire genome 
which can then be applied to other organisms. Such global 
approaches are essential if we are to gain an understanding of 
the living cell. This understanding should come from the 
viewpoint of the integration of complex regulatory networks, 
the individual roles and interactions of thousands of functional 
gene products, and the effect of environmental changes on 
both gene regulatory networks and the roles of all gene 
products. The time has come to switch from the analysis of a 
single gene to the analysis of the whole genome. 

Support was provided by National Institutes of Health Grants 
R37H60198 and P01H600205. 
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Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
first complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3] and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae [4]. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion of the Homo sapiens DNA sequence is not far 
behind [5]. 

To exploit more fully the wealth of new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 
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tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 

cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of large-scale DNA 
arrays. All of these platforms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 
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[13,14]. Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescently 
tagged dUTP (e.g., Cy3-dUTP and Cy5-dUTP), which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe," are then 
mixed and hybridized to the array under a glass cov- 
erslip [10,11,15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for fluor excitation [10,11,15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20]. The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces cervisiae [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7,22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones [7,22-24]. In 
expression profiling on filter membranes, two dif- 
ferent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing by hybrid- 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photoli- 
thography is theoretically simple but technically 
complex [29,30]. The light from a high -intensity 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting in 
deprotection of the terminal nucleotides in the illu- 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cycles 
(where n = oligonucleotide length in bases) to syn- 
thesize a vast number of unique oligos, the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29,31,32]. 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A)+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin (e.g., phycoerythrin) 
after hybridization [12,33]. The signal is detected with 
a custom confocal scanner [34]. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28,36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37]. In addition, 
mutations in the cystic fibrosis [38] and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring [33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain 5. cerevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human [41] and yeast [42] genomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 

Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat, mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by, and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 
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far more sensitive, characteristic, and measurable 
endpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
bons (PAHs)). Cells are then treated with these agents 
at a fixed toxicity level (as measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip (Figure 1). We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cants, termed a toxicant signature, is determined. 

This signature is derived by ranking across all ex- 
periments the gene-expression data based on reia- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g., thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2, we developed 
the custom cDNA microarray chip ToxChip vl.O. 

Treated 
Population 
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DNA "Chip" 
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Figure 1. Simplified overview of the method for sample trative purposes, samples derived from cell culture are depicted, 
preparation and hybridization to cDNA microarrays. For illus- although other sample types are amenable to this analysis. 
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Figure 2. Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 



consistent changes in group A genes (indicated by red and 
green circles), but not group B or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of action is assigned to 
the unknown agent. 



The 2090 human genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cyclins, 
kinases, phosphatases, cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips for other model systems, 
including rat, mouse, Xenopus, and yeast, for use in 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model systems for toxicology test- 
ing. Unfortunately, these assays are inherently ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NIEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip v1.0: A Human cDNA Microarray 
Chip Designed to Detect Responses to Toxic Insult 

No. of genes 



Gene category on chip 



Apoptosis 72 

DNA replication and repair 99 

Oxidative stress/redox homeostasis 90 

Peroxisome prol iterator responsive 22 

Dioxin/PAH responsive 12 

Estrogen responsive 63 

Housekeeping 84 

Oncogenes and tumor suppressor genes 76 

Cell-cycle control 51 

Transcription factors 1 3 1 

Kinases ^76 

Phosphatases 88 

Heat-shock proteins 23 

Receptors 349 

Cytochrome P450s ; 30 



*This list is intended as a general guide. The gene categories are not 
unique, and some genes are listed in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
select a bioassay more specifically suited to the agent 
in question or perhaps suggest that a bioassay is not 
necessary, which would dramatically reduce cost, 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
screened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, gene-expression changes are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



also be improved by the addition of microarray analy- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint, gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
lymphocytes of Polish coke-oven workers exposed 
to PAHs (and many other compounds) is under con- 
sideration at the NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way [44,45]. 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this variability can be a causative factor in human 
diseases of environmental origin [46,47]. A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
variability and toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on single 
base-pair differences makes these arrays uniquely 
useful for this type of analysis. Recent reports dem- 
onstrated the feasibility of this approach [41,42], 
The NIEHS has initiated the Environmental Genome 
Project to identify common sequence polymor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot study on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arrays will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
susceptibility. 

FUTURE PRIORITIES 

There are many issues that must be addressed be- 
fore the full potential of microarrays in toxicology 
research can be realized. Among these are model sys- 
tem selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
species, at what dose, and at what time do we look 
for toxicant-induced gene expression? If human 
samples are analyzed, how variable is global gene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age, diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for array scientists 
is the construction of a national public database 
(linked to the existing public databases) to serve as a 
repository for gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 
information. Researchers at the National Institutes 
of Health have made laudable progress in develop- 
ing the first generation of such a database [44,45]. In 
addition, improved statistical methods for gene clus- 
tering and pattern recognition are needed to ana- 
lyze the data in such a public database. 

The proliferation of different platforms and meth- 
ods for microarray hybridizations will improve 
sample handling and data collection and analysis and 
reduce costs. However, the variety of microarray 
methods available will create problems of data com- 
patibility between platforms. In addition, the near- 
infinite variety of experimental conditions under 



which data will be collected by different laborato- 
ries will make large-scale data analysis extremely dif- 
ficult. To help circumvent these future problems, a 
set of standards to be included on all platforms 
should be established. These standards would facili- 
tate data entry into the national database and serve 
as reference points for cross-platform and inter-labo- 
ratory data analysis. 

Many issues remain to be resolved, but it is clear 
that new molecular techniques such as microarray 
hybridization will have a dramatic impact on toxicol- 
ogy research. In the future, the information gathered 
from microarray-based hybridization experiments will 
form the basis for an improved method to assess the 
impact of chemicals on human and environmental 
health. 
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The availability of genome-scale DNA sequence information and reagents has radically altered !ife=science 
research. This revolution has led to the development of a new scientific subdiscipline derived from a combina- 
tion of the fields of toxicology and genomics. This subdiscipline, termed toxicogenomics, is concerned with the 
identification of potential human and environmental toxicants, and their putative mechanisms of action, through 
the use of genomics resources. One such resource is DNA microarrays or "chips," which allow the monitoring of 
the expression levels of thousands of genes simultaneously. Here we propose a general method by which gene 
expression, as measured by cDNA microarrays, can be used as a highly sensitive and informative marker for 
toxicity. Our purpose is to acquaint the reader with the development and current state of microarray technol- 
ogy and to present our view of the usefulness of microarrays to the field of toxicology Mol. Carcinog. 24:153- 
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INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
first complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3] and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae [4 J. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion of the Homo sapiens DNA sequence is not far 
behind [5]. 

To exploit more fully the wealth of new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6), high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 



Almost without exception, gpne expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 

cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of large-scale DNA 
arrays. All of these platforms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 



•Correspondence to: Laboratory of Molecular Carcinogenesis. 
National Institute of Environmental Health Sciences, 1 1 1 Alexander 
Drive, Research Triangle Park. NC 27709 
Received 8 December 199S; Accepted 5 January 1999 
Abbreviations' PAH. porycyclic aromatic hydrocarbon; NIEHS, Na- 
tional institute of Environmental Health Sciences. 



O 1999 WILEY-USS, INC 



NUWAYSIRETAL 



154 

[13,14]. Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescently 
tagged dUTP (e.g., Cy3-dUTP and Cy5-dUTP), which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe," are then 
mixed and hybridized to the array under a glass cov- 
erslip [10,11,15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for fluor excitation [10,11,15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20]. The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces cervisiae [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7,22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones [7,22-24]. In 
expression profiling on filter membranes, two dif- 
ferent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing by hybrid- 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photoli- 
thography is theoretically simple but technically 
complex [29,30]. The light from a high-intensity 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting in 
deprotection of the terminal nucleotides in the illu- 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cycles 
(where n = oligonucleotide length in bases) to syn- 
thesize a vast number of unique oligos, the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29,31,32]. 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A)+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin (e.g., phycoerythrin) 
after hybridization [12,33]. The signal is detected with 
a custom confocal scanner [34]. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28,36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37]. In addition, 
mutations in the cystic fibrosis [38] and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are alsu useful for expression monitoring [33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain S. cerevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human [41] and yeast [42] genomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 

Screening for Mechanism of Action 

The field of toxicology' uses numerous in vivo 
model systems, including the rat, mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by, and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 
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far more sensitive, characteristic, and measurable 
endpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
bons (PAHs)). Ceiis are then treated with these agents 
at a fixed toxicity level (as measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip (Figure 1). We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cants, termed a toxicant signature, is determined. 

This signature is derived by ranking across all ex- 
periments the gene-pxpression data based on rcla- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g., thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2, we developed 
the curium cDNA microarray chip ToxChip vl.O. 
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Figure 1. Simplified overview of the method for sample trative purposes, samples derived from cell culture are depicted, 
preparation and hybridization to cDNA microarrays- For illus- although other sample types are amenable to this analysis. 
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Figure 2. Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 



consistent changes in group A genes (indicated by red and 
green circles), but not group B or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, arrd a putative mechanism of action is assigned to 
the unknown agent. 



The 2090 human genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cyclins, 
kinases, phosphatases, cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips for other model systems, 
including rat, mouse, Xenopus, and yeast, for use in 
toxicology' studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model systems for toxicology test- 
ing Unfortunately, these assays are inherently ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NIEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxic:ty, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Gene category 



No. of genes 
on chip 



72 
99 
90 
22 
12 
63 
84 
76 
51 
131 
276 
88 
23 
349 
30 



Apoptosis 

DNA replication and repair 
Oxidative stress/redox homeostasis 
Peroxisome proliferator responsive 
Dioxin/PAH responsive 
Estrogen responsive 
Housekeeping 

Oncogenes and tumor suppressor genes 
Cell-cycle control 
Transcription factors 
Kinases 
Phosphatases 
Heat-shock proteins 
Receptors 
Cytochrome P450s 

•This list is intended as a general guide. The gene cateaories are not 
unique, and some genes are listed in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
select a bioassay more specifically suited to the agent 
in question or perhaps suggest that a bioassav is not 
necessary, which would dramatically reduce cost, 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassav and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
screened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, gene-expression changes are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



also be improved by the addition of microarray analy- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint, gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-cxprcs- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 

„ w ,„ 61w; y iC w5c iuirugates or saretv. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
lymphocytes of Polish coke-oven workers exposed 
to PAHs (and many other compounds) is under con- 
sideration at the NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way [44,45]. 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this variability can be a causative factor in human 
diseases of environmental origin [46,47]. A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
variability and toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on single 
base-pair differences makes these arrays uniquely 
useful for this type of analysis. Recent reports dem- 
onstrated the feasibility of this approach [41,42]. 
The NIEHS has initiated the Environmental Genome 
Project to identify common sequence polymor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot study on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arrays will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
susceptibility. 

FUTURE PRIORITIES 

There are many issues that must be addressed be- 
fore the full potential of microarrays in toxicology 
research can be realized. Among these are model sys- 
tem selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
species, at what dose, and at what time do we look 
for toxicant-induced gene expression? If human 
samples are analyzed, how variable is global gene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age, diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for array scientists 
is the construction of a national public database 
(linked to the existing public databases) to serve as a 
repository for gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 
information. Researchers at the National Institutes 
of Health have made laudable progress in develop- 
ing the first generation of such a database [44,45]. In 
addition, improved statistical methods for gene clus- 
tering and pattern recognition are needed to ana- 
lyze the data in such a public database. 

The proliferation of different platforms and meth- 
ods for microarray hybridizations will improve 
sample handling and data collection and analysis and 
reduce costs. However, the variety of microarrav 
methods available will create problems of data com- 
patibility between platforms. In addition, the near- 
infinite variety of experimental conditions under 



which data will be collected by different laborato- 
ries will make large-scale data analysis extremely dif- 
ficult. To help circumvent these future problems, a 
set of standards to be included on all platforms 
should be established. These standards would facili- 
tate data entrv into the national database and serve 
as reference points for cross-platform and inter-labo- 
ratory data analysis. 

Many issues remain to be resolved, but it is clear 
that new molecular techniques such as microarray 
hybridization will have a dramatic impact on toxicol- 
ogy research. In the future, the information gathered 
from microarray-based hybridization experiments will 
form the basis for an improved method to assess the 
impact of chemicals on human and en%'ironmental 
health. 

ACKNOWLEDGMENTS 

The authors would like to thank Drs. Robert 
Maronpot, George Lucier, Scott Masten, Nigel 
Walker, Raymond Tennant, and Ms. Theodora 
Deverenux for critical review of this manuscript. EFN 
was supported in part bv NTEHS Training Grant 
#ES070l7-24. 

REFERENCES 

1 . http://www.ncbi.nlm nih.gov/Web/Genbank/jndex.himl 

2. http://www.ncbt.nim.nih.gov/Entre2/Genome/ora.html 

3 Fieischmann RD, Adams MD. White 0. et al. Whole-oenome ran- 
dom sequencina and assembly of Haemophilus influenzae Rd 
Science 1995;269:496-512. 

4. Goffeau A, Barrell BG. Bussey H. et al. Life with 6000 aenes 
Science 1996.274:546, 563-567. 

5. http://www.perkin-elmer.com/press/prc5448.html 

6. Liang P, Pardee AB. Differential display of eukaryotic messenger 
RNA by means of the polymerase chain reaction. Science 
1992;257:967-971. 

7. Pietu G, Alibert 0, Guichard V. et al. Novel gene transcripts pref- 
erentially expressed in human muscles revealed by quantitative 
hybridization of a high density cDNA array. Genome Res 
1996;6:492-503. 

8. Zhao ND. Hashtda H, Takahashi N. Misumi Y, Sakaki v -on-den- 
sity cDNA filter analysis — A novel approach for farge-sr^-e. quan- 
titative analysis of gene expression. Gene 1995;156:207-213. 

9. Velculescu VE. Zhang L. Vogelstein B. Kmzler KW. Serial analysis 
of gene expression. Science 1995;270:484-487. 

K xhena M, Shalon D. Davis RW, Brown PO. Quantitative moni- 
:ormg of gene-expression patterns with a complementary DNA 
microarray. Science 1995;270:467^J70. 

1 1 . DeRisi J, Peniand L. Brown PO. et al. use of a cDNA microarray to 
analyse qene expression patterns in human cancer. Nat Genet 
1996.14-457-460. 

12. Wodicka L. Dong HL. Mittmann M, Ho MH, Lockhart DJ. Ge- 
nome-wide expression monitoring m Saccharomyces cerevisiae. 
Nat Biotechnol 1997;15:1359-1367. 

13. Marshall A. Hodoson J DNA chips: An array of possibilities. Nat 
Biotechnol 1998;16:27-31. 

14. http://www,synteni.com 

15. Shalon D. Smith SJ. Brown PO. A DNA microarray system for 
analyzing complex DNA samples using two-color fluorescent 
probe hybridization. Genome Res 1996;6:639-645. 

16. Chen Y. Dougherty ER. Bittner ML. Ratio-based decisions and 
the Quantitative analysis of cDNA microarray images. Biomedical 
Optics l997;2.364-374. 

17. Khan J. Simon R, Bittner M. et al. Gene expression profiling of 
alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res 
1998;58:5009-5013. 

18. Scnena M, Shalon D. Heller R. Chai A. Brown PO, Davis RW. 
Parallel human genome analysis: Microarray-based expression 
monitoring of 1000 genes. Proc Natl Acad Sc» USA 1996; 
93:10614-10619. 



MICROARRAYS AND TOXICOLOGY 



159 



19. Lashkari DA, DeRisi JL. McCusker JH, et at. Yeast microarrays for 
genome wide parallel genetic and gene expression analysis. Proc 
Natl Acad Sci USA 1997;94:13057-13062. 

20. Heller RA. Schena M, Chai A, et al. Discovery and analysis of 
inflammatory disease-related genes using cDNA microarrays. Proc 
Natl Acad Sci USA 1997;94:2 1 S0-2 1 55. 

21. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and ge- 
netic control of gene expression on a genomic scale. Science 
1997;278:680-686. 

22. Drmanac S. Stavropoulos NA, Labat I, et al. Gene-representing 
cDNA clusters defined by hybridization of 57,419 clones from 
infant brain libraries with short oligonucleotide probes. Genomics 
1996;37:29-40. 

23. Milosavljevic A, Savkovic S, Crkvenjakov R, et al. DNA seauence 
recognition by hybridization to short oligomers: ExDenmental 
verification of the method on the E. colt genome. Genomics 
1996;37:77-86. 

24. Drmanac S, Drmanac R. Processing of cDNA and genomic 
kilobase-size clones for massive screening, mapping and se- 
Quencmg by hybridization. Biotechniques 1934;17:328-323, 
332-336. 

25. http77www.resgen.com/ 

26. http.7Avww.genomesystems.com/ 

27. http://www.domech.com/ 

28. Pease AC, Solas DA, Fodor SPA. Parallel synthesis of soatiaiiy 
addressable oligonucleotide probe matrices' Abstract. Abstracts 
of Papers of the American Chemical Society 1992;203:34. 

29. Pease AC, Solas D. Sullivan EJ, Cronin MT, Holmes CP, Fodor 
SPA. Light-generated oligonucleotide arrays for rapid DNA 
sequence analysis. Proc Natl Acad Sci USA 1994;91 :5022- 
5026. 

30. Fodor SPA, Read JL, Pirrung MC. Stryer L, Lu AT, Soias D. ugnt- 
directed, spatially addressable parallel chemical symnesis. Sci- 
ence 1991;251:767-773. 

31. McGall G. Labadie J, Brock P, Wallraff G, Nguyen T, Hmsberg VV. 
Light-directed synthesis of high-aensity oligonucleotide arrays 

USinq Semiconductor DhOtOffKKtS Prnr Mat) A c = rj C c; nfA 

1996;93:13555-13560. 

32. bpshutz RJ, Morris D, Chee M, et al. Using oligonucleotide orobe 
arrays to access genetic diversity. Biotechniques 1 995; 1 9:442-447. 

33. Lockhart DJ, Dong HL, Byrne MC, et al. Expression monitoring 



by hybridization to hioh-density oligonucleotide arrays. Nat 
Biotechnol 1996;14:1675-1680. 

34. http://wvvw.mdyn.cam/ 

35. Sapolsky RJ, Upshutz RJ. Mapping genomic library clones using 
oligonucleotide arrays. Genomics 1996 ;3 3:445-4 56. 

36. Chee M. Yang R, HubbeH E, et al. Accessing genetic information 
with high-density DNA arrays. Science 1996;274:610-614. 

37. Hacia JG. Makalowski VV, Edgemon K, et al. Evolutionary se- 
quence comparisons using high-oensity olioonucleotide arrays. 
Nat Genet 1998;18:155-158. 

38. Cronin MT, Fucmi RV, Kim SM, Masino RS, Wespi RM, Miyada 
CG. Cystic fibrosis mutation detection by hybridization to hght- 
generated DNA probe arrays. Hum Mutat 1996;7:244-255. 

39. Hacia JG. Brody LC. Chee MS. Fodor SPA. Collins F5. Detection 
of heterozygous mutations in BRCA1 using high density oligo- 
nucleotide arrays and two-colour fluorescence analysis. Nat Genet 
1996; 14:441^147. 

40. Kozal MJ, Shah N. Shen NP, et al. Extensive polymorphisms ob- 
served in HlV-1 clade B protease oene usma hiah-oensiry oiiqo- 
nucieotide arrays. Nat Med 1996;2:753-759. ~ 

41 . Wang DG, Fan JB, Siao CJ, et al. Large-scale identification, map- 
ping, and genotypmg of smgle-nucleotide oolymorpnisms in the 
human genome. Science 1998;280:1077-1082. 

42. Winzeler EA, Richards DR. Conway AR. et al. Direct allelic varia- 
tion scanning of the yeast genome. Science 199S;2S 1 : 1 194-1 197. 

43. Chhabra RS~ Huff JE, Schwetz BS, Selkirk J. An overview of 
prechromc and chronic toxicity carcinogenicity expenmental-study 
designs and criteria used by the National Toxicology Proqram. 
Environ Health Perspect 1990;86:313-321. 

44. Ermolaeva 0. Rastogi M. Pruitt KD, et al. Data management and 
analysis for gene expression arrays. Nat Genet 199S~ 20: 19-23. 

45 . nttp://vwvw.nngr i.an .gov/Dln/LC G/ 1 5 K/HTML/doase. nim 

46. Samson M, Libert F, Doranz BJ, et al. Resistance to HlV-1 infec- 
tion m Caucasian individuals bearing mutant alleles of the CCR- 
5 chemokme receptor gene. Nature 1996.382.722-725 

47. Bell DA, Taylor JA, Paulson DF, Robertson CN. Mohler JL. Lucier 
GVV. Genetic risK ana carcinogen exposure — A common inher- 
ited defect of the carcinogen-metabolism gene glutathione-S- 
uansterase Ml (Gstml) that increases susceptibility to bladder 
cancer. J Natl Cancer Inst 1993;85:1 159-1 164. 

48. http://www.niehs.nih.gov/envgenom/home.html 



Docket No.: PT-I042 USN 
USSN: 10/009,416 
Ref.No.4 0F5 



T xicol gy 
Letters 

Toxicology Letters 112-113 (2000) 467-47 1 ===== 

www.elsevier.com/locate/toxlet 

Expression profiling in toxicology — potentials and 

limitations 

Sandra Steiner *, N. Leigh Anderson 

Large Scale Biology Corporation, 9620 Medical Center Drive, Rockville, MD 20850-3338, USA 



Abstract 

Recent progress in genomics and proteomics technologies has created a unique opportunity to significantly impact 
the pharmaceutical drug development processes. The perception that cells and whole organisms express specific 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular fingerprints, 
indicative of a drug's efficacy and potential toxicity are accessible. The integration into state-of-the-art toxicology of 
assays allowing one to profile treatment-related changes in gene expression patterns promises new insights into 
mechanisms of drug action and toxicity. The benefits will be improved lead selection, and optimized monitoring of 
drug efficacy and safety in pre-clinical and clinical studies based on biologically relevant tissue and surrogate markers. 
© 2000 Elsevier Science Ireland Ltd. All rights reserved. 
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1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment of 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on its mode of 
action and its mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. 1). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proteomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 



If 
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2. Global mRNA profiling 

Expression data at the mRNA level can be 
produced using a set of different technologies 
such as DNA microarrays, reverse transcript 
imaging, amplified fragment length polymorphism 
(AFLP), serial analysis of gene expression 
(SAGE) and others. Currently, DNA microarrays 
are very popular and promise a great potential. 
On a typical array, each gene of interest is repre- 
sented either by a long DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion (PCR) and spotted on a suitable substrate 
using robotics (Schena et al., 1995; Shalon et al., 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support using 
photolabile nucleotide chemistry (Fodor et al., 
1991; Chee et al., 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 

I 



3. Global protein profiling 

Global quantitative expression analysis at the 
protein level is currently restricted to the use of 
two-dimensional gel electrophoresis. This tech- 
nique combines separation of tissue proteins by 
isoelectric focusing in the first dimension and by 
sodium dodecyl sulfate slab gel electrophoresis- 
based molecular weight separation on the second, 
orthogonal dimension (Anderson et al., 1991). 
The product is a rectangular pattern of protein 
spots that are typically revealed by Coomassie 
Blue, silver or fluorescent staining (Fig. 2). 
Protein spots are identified by mass spectrometry 
following generation of peptide mass fingerprints 
(Mann et al., 1993) and sequence tags (Wilkins et 
al., 1996). Similar to the mRNA approach, the 
ratio between the optical density of spots from 
control and treated samples are compared to 
search for treatment-related changes. 

4. Expression data analysis 

Bioinformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 



Genomics 



Proteomics 



(DNA arrays, AFLP, RTI, SAGE ) 



(Two-D Electrophoresis) 



DNA • 



primary 
► RNA — 
transcript 



transcriptional 



mRNA 



+ 

mRNA 
transport 
control 

mRNA 



primary 



"mature" 



product 
translations 



protein 
maturation modificatlon(s) 
control e.g., PCM, etc 



degradation J f 



Fig. 1. Production of an active protein is a multistep process in which numerous regulation systems exert control at various stages 
of expression. Molecular fingerprints of drugs can be visualized through expression profiling at the mRNA level (genomics) using 
a variety of technologies and at the protein level (proteomics) using two-dimensional gel electrophoresis. 
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Fig. 2. Computerized representation of a Coomassie Blue stained two-dimensional gel electrophoresis pattern of Fischer F344 rat 
liver homogenate. 



quantitative expression data has been collected, is 
to visualize complex patterns of gene expression 
changes, to detect pathways and sets of genes 
tightly correlated with treatment efficacy and toxi- 
city, and to compare the effects of different sets of 
treatment (Anderson et al., 1996). As the drug 
effect database is growing, one may detect similar- 
ities and differences between the molecular finger- 
prints produced by various drugs, information 
that may be crucial to make a decision whether to 
refocus or extend the therapeutic spectrum of a 
drug candidate. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several synergies and overlaps of data 
obtained by mRNA and protein expression analy- 
sis. Low abundant transcripts may not be easily 
quantified at the protein level using standard two- 
dimensional gel electrophoresis analysis and their 
detection may require prefractionation of sam- 
ples. The expression of such genes may be prefer- 
ably quantified at the mRNA level using 
techniques allowing PCR-mediated target amplifi- 
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cation. Tissue biopsy samples typically yield good 
quality of both mRNA and proteins; however, the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA when compared with proteins. RNA sam- 
ples from body fluids such as serum or urine are 
often not very 'meaningful', and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safety. Detection of post- 
translational modifications, events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer, 1997) further 
suggests that the two approaches, mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 

6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drug 
effects and enhance the chances of recognizing 
potential species specificities contributing to an 
improved risk profile in humans (Richardson et 
al, 1993; Steiner et al., 1996b; Aicher et al., 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et al., 1991, 
1995, 1996; Steiner et al. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et al., 1998). In later phases of drug devel- 



opment, surrogate markers of treatment efficacy 
and toxicity can be applied to optimize the moni- 
toring of pre-clinical and clinical studies (Doherty 
et al., 1998). 



7. Perspectives 

The basic methodology of safety evaluation has 
changed little during the past decades. Toxicity in 
laboratory animals has been evaluated primarily 
by using hematological, clinical chemistry and 
histological parameters as indicators of organ 
damage. The rapid progress in genomics and pro- 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promises to improve lead se- 
lection, resulting in the development of drug can- 
didates with higher efficacy and lower toxicity. 
The identification of biologically relevant surro- 
gate markers correlated with treatment efficacy 
and safety bears a great potential to optimize the 
monitoring of pre-clinical and clinical trails. 
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ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the scop database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) /. Mol. BioL 247, 536-540]. The evalua- 
tion tested the programs BLAST [Altschul, S. F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990)./. Mol. BioL 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods EnzymoL 266, 460-480], FASTA [Pearson, W. R. & 
Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], 
and ssearch [Smith, T. F. & Waterman, M. S. (1981) /. Mol. 
Biol. 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-value statistical scores of SSEARCH and fasta are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P- values reported 
by BLAST and WU-BLAST2 exaggerate significance by orders of 
magnitude, ssearch, fasta ktup = 1, and WU-BLAST2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 
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Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
BLAST (1) have changed, and WU-BLAST2 (2) — which produces 
gapped alignments — has become available. The latest version 
of FASTA (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 

ptt'trrtn ii/QftrO Tiiot it> lift* fit f*"0/»ttQ« r\f KQtr»Qlrjr»QllC prQt*»i«c 

can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
scop: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in ssearch (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided BLAST (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is fasta (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
pir database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of PIR 
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superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than FASTA, which was in turn more effective 
than BLAST. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and prosite are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but PIR places them in different superfamilies. 
The problem is widespread: each superfamily in pir 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other pir superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the BLAST program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the blast 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 
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is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the SCOP database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The SCOP database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (PDB90D-B) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40% 
identical. The databases were created by first sorting all 
protein domains in scop by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or ^0.5% of the total 1,749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the seg program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of scop 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the pdb of a small number of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested blast (1), version 1.4.9MP, and wu- 
BLAST2 (2), version 2.0al3MP. Also assessed was the fasta 
package, version 3.0t76 (3), which provided FASTA and the 
SSEARCH implementation of Smith-Waterman (8). For 
SSEARCH and FASTA, we used BLOSUM45 with gap penalties 
— 12/— 1 (7, 16). The default parameters and matrix (blo- 
SUM62) were used for BLAST and WU-BLAST2. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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Coverage Coverage 

Fig. 1. Coverage vs. error plots of different scoring schemes for ssearch Smith-Waterman. (A) Analysis of PDB40D-B database. (B) Analysis 
of PDB90D-B database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the x axis indicates the fraction of 
all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common super family. PDB40D-B contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of EPQ. Because there are 1,323 queries made in the PDB40D-B all-vs.-all 
comparison, 13 errors corresponds to 0.01, or 1% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The hssp equation (17) is H = 290.15/ -0 - 562 where 
/ is length for 10 *t / *t 80; H 100 for / *t 10; H = 24.7 for / ^* SO. Ths percentage identity Hssp-sdjusted score is the percent identity within 
the alignment minus H. Smith-Waterman raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 
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Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin /5-chain (pdb code lhds chain b, ref. 38, Left) and cellulase E2 
(pdb code ltml, ref. 39, Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered by 

RASMOL (40). 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 
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Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in PDB90D-B: Each pair of nonhomologous proteins found with 
ssearch is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 
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Fig. 4. Reliability of statistical scores in PDB90D-B: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
fast A, whereas P-values are shown for blast and wu-blast2. If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and fasta are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and WU-BLAST2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for PDB90D-B 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In blast, a measure 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the hssp equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but ln-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 



Sequence Comparison Algorithms (PDB90D-B) 




Coverage 



Fig. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). (A) PDB40D-B database. In this analysis, the best method is the slow ssearch, which finds 18% of relationships 
at 1 % EPQ. fasta ktup = 1 and wu-blast2 are almost as good. (B) PDB90D-B database. The quick wu-blasT2 program provides the best coverage 
at 1% EPQ on this database, although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 
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likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret, ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus, an 
E-value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from BLAST also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from BLAST, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. blast, which 
identifies 15% wns the worst performer, where 2.S FASTA 
ktup = 1 is nearly as effective as ssearch. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower. SSEARCH is 25 times slower than blast and 6.5 times 
slower than fasta ktup = 1. WU-BLAST2 is slightly faster than 
fasta ktup = 2, but the latter has more interpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 5B). The method which finds that many 
relationships is WU-BLAST2. Consequently, we infer that the 
differences between fasta kup = 1, ssearch, and WU-BLAST2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance. SSEARCH with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fig. 6. Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs pdbwd-b according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method 
(ssearch with E-values) at 1% EPQ. The PDB40D-B database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
blast was released: BLASTGP (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on BLASTGP using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (/) using a large current database 
in which the protein sequences have been complexity masked 
and («) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by FASTA and SSEARCH give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and WU-B1AST2 underestimate the true 



Table 1. Summary of sequence comparison methods with PDB40D-B 



Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at 1 


ssearch % identity: within alignment 


25.5 


>70% 


<0.1 


ssearch % identity: within both 


25.5 


34% 


3.0 


ssearch % identity: HSSP-scaled 


25.5 


35% (hssp + 9.8) 


4.0 


ssearch Smith- Waterman raw scores 


25.5 


142 


10.5 


ssearch E-values 


25.5 


0.03 


18.4 


fasta ktup = 1 E-values 


3.9 


0.03 


17.9 


fasta ktup = 2 E-values 


1.4 


0.03 


16.7 


WU-BLAST2 P-values 


1.1 


0.003 


17.5 


blast P-values 


1.0 


0.00016 


14.8 



♦Times are from large database searches with genome proteins. 
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extent of errors. Second, ssearch, WU-BLAST2, and fasta 
ktup = 1 perform best, though BLAST and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 



** Additional and updated information about this work, including 
supplementary figures, may be found at http://sss.stanford.edu/sss/. 
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